



































































Open new doors at Boeing 


Some of the most respected scientific minds in the country are open¬ 
ing new doors to significant advancements in electronics. 

If you’re qualified, you could join them at the new Boeing Electronics 
High Technology Center in Bellevue, Washington. Just minutes from the 
University of Washington and Seattle. In an attractive, creative environ¬ 


ment for advanced research, away from distractions. 

The Center is equipped with state of the art laboratories where you 
can develop, test and perfect your ideas. 

The positions are open to candidates with advanced degrees (MS or 
Ph.D) in Electronic Engineering, Physics, Physical Chemistry, Materials 
Science, Computer Science or Applied Mathematics. 

If you believe as we do, that die most exciting era in Electronics, 
Electronics/Photonics has just begun, send your r6sumd, with present 
and expected salary, to The Boeing Company, P.O. Box 3707-DBG, 
Seattle, WA 98124. Who knows. You may be the next person to 
cross the threshold to an electronics/photonics 
breakthrough. , v/ . 

We re an equal opportunity employer. 


Boeing needs world- 
class engineers and 
scientists for these 
specialties: 

Photonics 
Fiber optics sensors 
Wideband data devices and 
components 
Infrared and visible 
spectrum sensors 
Optical communications 
Optical information 
processing 

Radio 
Frequency 
Monolithic microwave 
integrated circuits 
Millimeter wave integrated 
circuit technology 
Advanced devices and 
circuit components 
Secure, reliable trans¬ 
mitters and receivers 

Microelectronics 
Radiation hardened circuits 
and devices 

Advanced group III-V circuit 
technology 

Advanced microelectronics 
packaging and interconnects 
Advanced integrated 
circuits 

Information 
Processing 
Ultrareliable computer 
architectures 
High performance com¬ 
puter architectures 
Signal and image 
processing 
Symbolic processing 
Machine vision 
Advanced display 
concepts 


Materials 

Processing 



Bulk and thin film device 
fabrication 
Materials and device 
characterization 

Independent 

Research 

Advanced electro-optic 
materials 
3D structures 
Innovative concepts 












Announcing new PC SIMSCRIPTII.5 
... with animation 



SIMSCRIPT II.5 with animation now on personal computers 


free trial—see how 

SIMSCRIPT II.5 helps you build a realistic model 


the complete 

SIMSCRIPT II.5 on a PC large models on your PC free trial 


SIMSCRIPT II.5 for personal 
computers is the same popular 
simulation language that is now wide¬ 
ly used on mainframes. 

You can now build realistic models 
of military, manufacturing, communi¬ 
cations, logistics, transportation or 
other systems on your PC. 

PC SIMSCRIPT includes a new 
programming environment that makes 
it easy for you to develop, verify, 
modify, and enhance simulation 
models on a personal computer, 
natural method of modelling 

When building a model in 
SIMSCRIPT, you describe the simu¬ 
lated system as consisting of certain 
types of entities: perhaps workers, 
machines and jobs in a simulated 
factory; or flights and airports in a 
simulated air transport system; or 
jobs, processors, channels, and I/O 
devices in a simulated computer 
system. 

For each type of entity you 
give names to the attributes that 
characterize it. 

You also name the sets an entity 
type may belong to, and the sets it 
may own. 

Since your model is English-like, 
with names that you choose, it reads 
like a description of the simulated 
system. The model can be read and 
verified by non-programmers who 
understand the system under study. 

This makes your model develop¬ 
ment, validation and evolutionary 
changes much easier. 


Your model and data are not 
limited by the size of the PC. 
SIMSCRIPT is the only simulation 
tool that automatically makes use of 
the hard disk as a memory extension, 
reduced cost 

SIMSCRIPT II.5® is a well estab¬ 
lished, standardized, and widely 
used language with proven software 
support. 

Experience has shown that SIM¬ 
SCRIPT II.5 reduces simulation 
programming time and cost 
severalfold compared to other 
simulation techniques, 
animated and graphical results 

With PC SIMSCRIPT II.5® you 
build models that can show an 
animated picture of the system under 
study. Observing the simulation 
improves understanding of the 
system and builds confidence in the 
model. 

Because you see the operation of 
the simulated system and can easily 
try alternatives, the time and cost of 
system analysis are sharply reduced. 

computers with SIMSCRIPT II.5 

1. IBM Personal Computer AT, 
XT, PC or compatible, with a hard 
disk. 

2. Most Mainframe computer types 
including IBM, CDC, VAX, Univac, 
Prime, Gould, Data General and 
Honeywell. 


SIMSCRIPT II.5 and PC SIMSCRIPT 11.5 are registered 
trademarks and service marks of CACI, INC.-FEDERAL 


The free trial package contains 
everything you need to try SIM¬ 
SCRIPT II.5 on your own computer. 

We send you PC or Mainframe 
SIMSCRIPT II.5, installation instruc¬ 
tions, sample models, and a complete 
set of documentation. You can build 
your own model or modify one of 
ours. No cost or obligation. 

special offer free training 
For a limited time we will also 
include free training. Space is limited 
so act now to avoid disappointment. 

Call Rick Crawford at (619) 
457-9681 to reserve your place. 


free trial-learn the reasons for the broad 
and growing popularity of SIMSCRIPT | 
II.5—no cost or obligation 

special offer-return the coupon today 
and we will include one free course enroll- I 
ment worth $850 





Computer Operating System 


Return to: ,EEEC0M 

CACI 

3344 North Torrey Pines Court 
La Jolla, California 92037 

Or, better yet, 

call Rick Crawford at (619) 457-9681 
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FEATURE ARTICLES 


12 Guest Editor's Introduction: Domesticating Parallelism 

David Gelernter 

Parallel computers are becoming increasingly widespread. The question now is what to do with them. 

Can programming-language researchers make the power of parallelism accessible to programmers? 

20 Parallel Processing in Ada 

David A. Mundie and David A. Fisher 

Ada’s tasking mechanism represents a bold attempt to allow large, complex parallel processing 
applications to be written in a portable high-level language. 

26 Linda and Friends 

Sudhir Ahuja, Nicholas Carriero, and David Gelernter 
Linda consists of a few simple primitives that support an “uncoupled” style of parallel programming. 
Implementations exist on a broad spectrum of parallel machines. 

35 Parallel Symbolic Computing 

Robert H. Halstead, Jr. 

Futures find parallelism in symbolic programs by allowing the manipulation of partially computed data. 

44 Concurrent Prolog: A Progress Report 

Ehud Shapiro 

A process-oriented language, Concurrent Prolog embodies dataflow synchronization and guarded- 
command indeterminacy as its basic control mechanisms. 

60 Para-Functional Programming 

Paul Hudak 

This methodology treats a multiprocessor as a single autonomous computer onto which a program is 
mapped, rather than as a group of independent processors. 

72 A Survey of Advanced Microprocessor and HLL Computer Architectures 

A. Silbey, V. Milutinovic, and V. Mendoza-Grado 

This survey classifies high-level language computer architectures and gives case studies of representative 
examples of each class of architecture. 
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University of Utah 
Department of Computer Science 
Department Chairperson 


The Computer Science Department at the University of Utah 
invites applications for the tenure track position of Chairperson. 
This appointment is for a three year, renewable term. 

The Department has an internationally recognized reputation for 
excellence in research and teaching, both at the undergraduate 
and graduate levels. There are currently 15 regular faculty, and 
the full complement of faculty numbers 28. The student popu¬ 
lation of the Department comprises approximately 35 PhD 
students, 40 MS students, and 250 undergraduate majors 
selected by an annual competition among pre-majors. 

Active research areas within the Department include computer 
aided geometric design, VLSI design, information retrieval archi¬ 
tecture, parallel processing, computer simulation, robotics and 
image processing, portable Al systems software, information 
based complexity theory, computer aided instruction, functional 
and logic programming, and computer music. 

The Department maintains a superb research computing facility, 
including a DECSystem 2060 and five VAX mainframes, including 
a VAX 8600. An 18-node BBN Butterfly and over seventy HP, 
Apollo, and Sun workstations are installed. Special purpose 
equipment is available for signal processing, graphics, robotics, 
and VLSI research. 

Starting date for the appointment is July 1,1987. Direct vita, along 
with the names of three references, to: 

Gary Lindstrom, Chairman 
Chairperson Search Committee 
University of Utah 
Department of Computer Science 
Salt Lake City, Utah 84112 

CONSIDERATION OF APPLICATIONS WILL CONTINUE UNTIL 1 FEBRUARY 
1987 OR UNTIL THE POSITION IS FILLED. 
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Moving? 

PLEASE NOTIFY 

US 4 WEEKS 

IN ADVANCE 

Name (Please Print) 

New Address 

City State/Country 

Zip 

MAIL TO: 

IEEE Service Center 

445 Hoes Lane 

Piscataway, NJ 08854 



ATTACH 

LABEL 

HERE 


• This notice of address change will apply to all 
IEEE publications to which you subscribe. 

• List new address above. 

• If you have a question about your subscription, 
place label here and clip this form to your letter. 


IEEE COMPUTER SOCIETY 
EXECUTIVE COMMITTEE 


President: Roy L. Russo* 

IBM T. J. Watson Research Center 
Route 134 
PO Box 218 

Yorktown Heights, NY 10598 
(914) 945-3085 
Vice Presidents 

Publications (1st VP): J. T. Cain* 
Technical Activities (2nd VP): John D. Musa* 
Conferences and Tutorials: James H. Aylor 
Educational Activities: Glen G. Langdon, Jr. 
Membership and Information: Ming T. Liu 
Area Activities: H. Troy Nagle 
Standards: Helen M. Wood 


Treasurer: Joseph E. Urban 
Secretary: Fletcher J. Buckley 
Junior Past President: Martha Sloan* 

IEEE Division Directors: Martha Sloan, Ronald G. Hoelzeman 
Executive Director: T. Michael Elliott* 

*Ex officio member of Board of Governors 


BOARD OF GOVERNORS 

TERM ENDING 1986 TERM ENDING 1987 


Dennis R. Allison 
Kenneth R. Anderson 
P. Bruce Berra 
Fletcher J. Buckley 
Richard C. Jaeger 
Ming T. Liu 
Michael C. Mulder 
Hillel Ofek 
Edward W. Thomas 
Joseph E. Urban 
Oscar N. Garcia* 


Barry W. Boehm 
Paul L. Borrill 
Glen G. Langdon, Jr. 
Duncan H. Lawrie 
Susan L. Rosenbaum 
Bruce Shriver 
Harold S. Stone 
Wing N. Toy 
Helen M. Wood 
Akihiko Yamada 


PUBLICATIONS BOARD 


Dharma P. Agrawal 
Vishwani D. Agrawal 
Dennis R. Allison 
Bill D. Carroll 
Michael Evangelist 
James J. Farrell III 
Tse-yun Feng 
Lansing Hatfield 
Ronald G. Hoelzeman 
Sam Horvitz 


Richard C. Jaeger 
Willis K. King 
Duncan H. Lawrie 
Jack Lipovski 
Ming T. Liu 
Michael C. Mulder 
Theo Pavlidis 
David Pessel 


C. V. Ramamoorthy 
Bruce D. Shriver 
Steve Tanimoto 


J. T. Cain, Vice President for Publications 


SENIOR STAFF 

Executive Director: T. Michael Elliott 
IEEE Computer Society 
1730 Massachusetts Ave., NW 
Washington, DC 20036-1903 
(202) 371-0101 

Editor and Publisher: True Seaborn 
Director, Computer Society Press: Chip G. Stockton 
Director, Conferences: William R. Habingreither 
Director, Tutorials: Martez A. Camillieri 
Director, Finance: Mary Ellen Curto 


NEXT BOARD OF GOVERNORS MEETING: 
Hotel Anatole 
Dallas, Texas 

November 7, 1986, 8:30 a.m.-5 p.m. 
THE INSTITUTE OF ELECTRICAL 
AND ELECTRONICS ENGINEERS, INC. 


President: Bruno O. Weinschel 
President-Elect: Henry L. Bachman 
Executive Vice President: Emerson W. Pugh 
Executive Director: Eric Herz 



















BE PRESENT 
AT THE DAWN 
OF CREATION. 

The computer scientists and engineers at Lockheed Missiles 
and Space Company are now entering a bold new era in 
computer technology. An era full of opportunity, as we explore 
the full potential of artificial intelligence. To achieve that goal, 
we are establishing a new Artificial Intelligence Research 
Center in Palo Alto, California. 

This Center will be dedicated to pure research into areas 
such as image and speech understanding, natural language 
understanding, knowledge-based and model-based systems. 
The Center will also support training, consulting and hardware/ 
software tool evaluation programs for the entire company. 

We're building a team of skilled professionals to meet the 
challenges ahead. This will truly be a ground floor opportunity. 
So if you're interested in defining tomorrow's technologies and 
are degreed in computer science, or an Al related field, we 
invite you to explore this unique opportunity. 

Lockheed Missiles & Space Company 

Innovation 


RESEARCH 

Evidential Reasoning 
Reasoning with Uncertainty 
Reasoning in Context 
Machine Learning 
Knowledge Acquisition 
and Representation 
Data Fusion 

Human-Machine Interface 
APPLICATIONS 

Maintenance and Diagnostics 
Satellite Autonomy 
Situation Assessment 
Data Interpretation 
Attention AAAI attendees. Stop by 
our suite at the Windham Franklin 
Plaza Hotel, 16th and Race. Details at 
the literature table near registration. 

For immediate consideration for¬ 
ward your resume to Professional 
Staffing, Dept. 530FE22, Lockheed 
Missiles & Space Company, P.O. Box 
3504, Sunnyvale, CA 94088-3504. We 
are an equal opportunity, affirma¬ 
tive action employer. U.S. citizenship 
is required. 









With Digital’s VAXstation II/GPX™ color workstation, Computer Aided 
Engineering and Computer Aided Software Engineering projects can now be 
tied into the computing resources of the entire company. 

VAXstation II/GPX’s built-in networking capabilities provide access to Digi¬ 
tal’s large VAX™ systems via Ethernet and DECnet™ networking software. And 
to other vendors’ systems via gateways and other communications protocols. 
By off-loading compute-intensive tasks to larger computers, the Micro VAX II™ 
CPU and GPX graphics coprocessor can concentrate on delivering exceptional 
graphics at tremendous speeds. Plus, you can monitor all aspects of a project 

VAXstation II/GPX: 

The end of isolation 
for CAE and CASE 
applications. 



on the screen’s multiple windows. 

VAXstation II/GPX runs popular CAE applications from such companies as 
Scientific Calculations, Inc., Silvar-Lisco;" Tektronix™ - CAE Systems Division 
and VLSI Technology, Inc. It also runs CASE applications from B.S.O., Interactive 
Development Environments, Nastec Corp. and Tektronix - SDP Division. 

VAXstation II/GPX. The workstation comes out of isolation and into the 
mainstream. For brochures, write: Digital Equipment 
Corporation, Media Response Manager, 200 Baker Ave., 

West Concord, MA 01742. Or call your local sales office. 


© Digital Equipment Corporation 1986. Digital, the Digital logo, DECnet, Micro VAX II, VAX and VAXstation II/GPX are trademarks of Digital Equipment Corporation. 
Silvar-Lisco is a trademark of Silvar-Lisco. Tektronix is a trademark of Tfcktronix, Inc. 
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Explore the Knowledge-based Society 
During the Professional Conference 


Really nine special conferences in one. Just to highlight one of the nine tracks covered at FJCC ’86- 


G 


• Education 

• Software Systems 

• Artificial Intelligence 

• Supercomputing 

• Algorithms 

• Modeling/Measurement 

• Computer Design 

• International Developments 

• Operating Systems/Data Bases/LAN’S 




FJCC ’86 Presents Software Systems 


a 


Vj 


The three most critical ingredients of a successful computer system 
are software — and software — and software. How do we make it? 
How do we test it? John White’s sessions on integrated environments 
present the current thinking on high-productivity tools. The Hypertext 
session shows Andrie van Dam’s influence on next-generation systems 
for handling documents. 

Engineering of these systems is no easy matter. Gerald Weinberg's 
talk on technology transfer will take you through how and how not 
to advance your own efforts in software engineering. If you succeed 
in producing a new gem of software, how do you know that it is reliable? 
Bill Hawden’s session on software testing shows that advanced testing 
techniques have moved from the research lab to the development 
lab. His strategy can become your strategy at the FJCC. 


* 
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Technical Forum Presentations 

‘‘This is an exciting period in the 
evolution of computer architectures. 
Symbolics, as the leader in symbolic 
processing architectures, wouldn’t miss 
being at FJCC ’86. Our Technical Forum 
presentation at FJCC ’86 will give us a 
unique chance to reach key decision 
makers. This is one show we really have 
to be a part of.” 

Joseph Morin, Director 
Federal Systems Group 
Symbolics, Inc. 














Join Us At The 

FALL JOINT COMPUTER CONFERENCE ’86 
INFOMART— Dallas, Texas 


The Forum for ACM’s and IEEE Computer 
Society’s Annual Conferences in 1986. 




Dr. Harold S. Stone 

Program Chairman 



Toni Shetler 

Chair, Professional 
Education Program 


As computer professionals, we are responsible for 
the increasingly complex information processing systems 
that our society depends on. To do our jobs properly, we 
have to pursue quality information and educational oppor¬ 
tunities. And invariably we have had to sacrifice either quality 
or quantity in choosing which seminars or conferences to attend. 

For once, however, you don’t have to make that choice. At FJCC 
you will have an opportunity to explore—in-depth—W//VE major arenas, 
and every one of these is worth attending. This has warranted the opinion 
from leading industry figures that this conference will be the biggest 
and best in the last ten years! We’ve also carefully scheduled the sessions 
so that you can either obtain a total immersion experience in just one 
or two subject areas; or gain a solid awareness for current developments 
in all of the nine vital arenas. 

Don’t miss this major event, and I’ll see you there! 


A serious concern for computing professionals and corporations that 
use and create this technology is retaining competence and current 
information in an environment where pressures driven by change, 
competition and cost are unprecedented. A critical aspect of solving 
this concern is providing solid educational opportunities for the 
computing professional that fit into a busy schedule. 

The Professional Education Program at FJCC offers in-depth training 
as an integral part of the conference program. PEP course offerings 
in Al, Software Engineering, Hands-On Computing . . . taught by some 
of the most distinguished lecturers/experts in the field today, provide 
an opportunity that computing professionals and their organizations 
cannot afford to miss. 


ir further conference information, return the coupon or contact: 


Program 
Dr. Harold S. Stone, 

IBM TJ. Watson Research Center 
P.O.Box 218 

Yorktown Heights, NY 10598 


Professional Development 
Toni Shetler 

TRW Systems Division Wl/4454 
7600 Colshire Drive 
McLean, VA 22102 



r further information. NOTE—early registration by October 















11th Conference On 
Local Computer Networks 

October 6-8,1986 Minneapolis, Minnesota 

You are invited to attend 

Tutorials - Paper Sessions - Panel Sessions - Discussions of Practical Experience 


A Conference Specializing In 

- local networks- 

Program Includes: 

Monday, October 6, 1986 

7:30 a.m.-9:00 a.m. Registration 

9:00 a.m.-5:00 p.m. Tutorial 1 

"LANs and the MAP/TOP Factory and Office Automation 

Protocols," Maris Graube, Consultant 

9:00 a.m.-5:00 p.m. Tutorial 2 

"Local Computer Networks: Software Applications," Noel 

Schmidt, Architecture Technology Corporation 

Tuesday, October 7, 1986 

8:00 a.m.-9:30 a.m. Registration 

9:30 a.m.-ll:00 a.m. Keynote Session 

"Evolution of Broadband LANs," J. Edward Snyder - General 

Manager, TRW Information Networks Division 

"Ethernet - From Technology To Solutions," Judith L. Estrin - 

Executive Vice President, Bridge Communications, Inc. 

11:15 a.m.-12:15 p.m. Session 1 - LAN Issues 
"Resource Sharing Local Networks,"R. Linebarger & D. Doggett, 
Brigham Young University 
"Jitter," Howard Salwen, Proteon 

12:15 p.m.-l:30 p.m. Lunch 
1:30 p.m.-3:00 p.m. Session 2 - LAN Testing 
Chairman: Harvey Freeman - Architecture Technology 
Panelists: Jim Amaral - Cabletron 
Steve Gibson - CMC 
Linda Stewart - Excelan 
Howard Salwen - Proteon 
Stephane Johnson - TTIN 

3:30 p.m.-5:00 p.m. Session 3 - MAP 
"Performance Management," K. H. Muralidhar, Industrial 
Technology Institute 

"High Performance Bus For MAP," Doug Jacobson, Iowa 
State University 

"Topological Aspects of MAP Network Design," J. R. 

Pimentel & G. F. Campbell, General Motors Corp. 

5:30 p.m.-6:30 p.m. Cocktail Party 

6:30 p.m.-9:00 p.m. Banquet Feature Comedy 

Presentation: Michael Jacobs 

Wednesday, October 8, 1986 

8:30 a.m.-10:00 a.m. Session 4a - Broadband 

"Design of a Broadband Network," Kanti Prasad, University of 

Lowell 


"Broadband Ethernet For Image Transmission," S. E. Hauser, 
M. A. Crocker, & T. R. Harris, National Institute of Health 
"The Future of Broadband," Philip Edholm, Sytek 

8:30 a.m.-10:00 a.m. Session 4b - LAN Software 

"Packet Capture System for LAN Software Development," N. 

Michael Minnich, University of Deleware 

"A Software Tool for Performance Analysis of LANs," Dennis 

S. Mok, Western Illinois University 

"Eland, An Expert System for LAN Applications 

Configuration," S. Ceri & L. Tanca, Politecnico di Milano 

10:30 a.m.-12:15 p.m. Session 5a - Token Rings 

"Higher Level Protocols on The Token-Ring Network," J. J. 

Carlo & G. R. Samsen, Texas Instruments 

"Performance of the TMS 380 Token Ring LAN Adapter," 

Cedell Alexander, Texas Instruments 

"LANs with Multiple Host Connections," D. F. Komblum, 

Bell Communications Research 

10:30 a.m.-12:15 p.m. Session 5b - Network 

Management 

"Living on The Leading Edge of Ethernet Use," Michael K. 
Molloy, University of Texas 

"A Laboratory for LAN Measurement Experimentation," 

R. Kinicki, P. K. Chiang & C. Walton, Polytechnic Institute 
"Network Monitoring," D. C. Feldmeier, MIT Lab For 
Computer Science 

"Network Management for Large HYPERchannel LANs," 

Ken Hardwick, Network Systems 
12:15 p.m.-l:30 p.m. Lunch 
1:30 p.m.-3:00 p.m. Session 6 - Software Issues 
Chairman: Mike McKee - Consultant 
Panelists: J. Scott Haugdahl - Architecture Technology 
Rod Merry - Computer Network Technology 
Chung Le - HDR 
Joel Halpem - Network Systems 
Bill Johnson - ViaNetix 

3:30 p.m.-5:00 p.m. Session 7 - Future Directions 
Chairman: Ron Rutledge - Transportation Systems Center 
Panelists: To be announced 


Sponsored by: 


IEEE Computer Society TC-Computer Communications 











PRE-REGISTER FOR THE 
11th Conference on 
Local Computer Networks 

Conference Site - Minneapolis Plaza Hotel 
315 Nicollet Mall, Minneapolis, MN 55401 


Tutorials: 

"LANs and the MAP/TOP Factory and Office 
Automation Protocols," Maris Graube, Consultant 
"Local Computer Networks: Software Applica¬ 
tions," Noel Schmidt, Architecture Technology 
Corporation 

Keynote Speakers: 

"Evolution of Broadband LANs," J. Edward 
Snyder - General Manager, TRW Information 
Networks Division 

"Ethernet - From Technology to Solutions," Judith 
L. Estrin - Executive Vice President, Bridge 
Communications, Inc. 

Banquet Feature Comedy Presentation: 

Michael Jacobs 

General Chairman: 

Bob Lutnicki 

Computer Network Technology 
9440 Science Center Drive 
New Hope, MN 55428 
(612) 535-8111 



Registration: 

Send registration and fee to: 
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Guest Editor's Introduction 


Domesticating 
Parallelism 


David Gelernter 
Yale University 


Parallel computers 
are becoming 
increasingly 
widespread. The* 2 
question now is what 
to do with them. Can 
programming- 
language researchers 
make the power of 
parallelism accessible 
to programmers? 
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I f many computers work simultane¬ 
ously on one problem, they should be 
capable of solving it faster than a sin¬ 
gle computer working alone. It should 
therefore be possible to build a powerful 
machine by connecting many subcomput¬ 
ers together into a single parallel com¬ 
puter. This much has been common 
wisdom since the sixties. It is less well- 
known that parallel computing, long a 
notorious hangout for Utopians, theorists, 
and backyard tinkerers, has almost arrived 
and is definitely for sale. Alliant, ELXSI, 
Encore, BBN, Intel, NCUBE, Sequent, 
and Thinking Machines among others 
market parallel computers. There will be 
more soon. A large number of experimen¬ 
tal projects continue to thrive as well, 
among them Caltech’s Cosmic Cube, 
IBM’s RP3, and AT&T BeU Labs’ S/Net. 
It seems clear that, within the foreseeable 
future, parallel machines will be available 
to most serious computer users. 

Once you have unpacked your parallel 
computer and plugged it in, what do you 
do with it? Parallelism is only useful, of 
course, in solving problems that may be 
attacked by many cooperating agents 
working simultaneously. There seem to be 
many such problems; the dominant group 
comes from numerical analysis and 
computational physics, and others involve 
graphics, image processing, systems, and 
artificial intelligence. It is therefore fairly 
simple to find a good problem. Having 
done so, step two is to write a program that 
solves the problem and runs on your paral- 
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lei machine. Here is where the difficulties 
start. 

How is a parallel machine to be pro¬ 
grammed? There are two main possibil¬ 
ities. We can provide either a parallelizing 
compiler or a parallel programming 
language. 

In the first case, programmers write 
conventional, non-parallel programs and 
leave it up to a smart compiler and runtime 
system to parallelize them. This solution is 
attractive because it absolves users from 
thinking about parallelism and, of course, 
it allows them to run their old programs on 
new parallel machines without rewriting. 
A good deal of progress has been made in 
this direction, and it is and will continue to 
be an important one. Here, however, we 
are exclusively concerned with the other 
possibility. Equipped with a parallel 
programming language, programmers 
formulate algorithms that are explicitly 
parallel—imagined in terms of many co¬ 
operating agents instead of a single one. 
The articles in this issue all deal with the 
design and implementation of languages 
of this sort. 


Parallel programming 
languages versus 
parallelizing compilers 

Why should we force programmers to 
worry about parallelism if they don’t real¬ 
ly have to? If they might plausibly con- 
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tinue to program in the usual way and 
leave parallelism to the compiler? Re¬ 
searchers often argue that, while a paral¬ 
lelizing compiler can transform an existing 
algorithm, it can’t rewrite the algorithm. 
Programmers freed to think in parallel 
sometimes invent entirely new ways of 
solving problems. Consider, as a simple 
example, the parallel alpha-beta algorithm 
described by Finkel and Fishburn. 1 The 
original alpha-beta algorithm is strongly 
sequential: it searches a tree of game posi¬ 
tions, using information developed earlier 
in the search to cut off later subsearches 
that can’t possibly improve on already- 
discovered sequences of moves. The paral¬ 
lel version is related but different. Starting 
with a given game state, it explores all 
next-states in parallel; any subcomputa¬ 
tion may be cut off as a result of any other. 
The original was asymmetric, but the 
parallel version is symmetric. Rethinking 
of this sort is an example of algorithm 
design, not of simple parallelizing 
transformation. 

Parallel languages have other points in 
their favor: Significant program classes 
exist whose members seem too complex or 
irregular ever to be optimally parallelized 
by compilers. But parallelizing compilers 
have further advantages as well. Both 
alternatives will be actively pursued for 
some time to come. 

Parallel programming sounds like a 
good idea, then—except that it’s impossi¬ 
ble, or at least too hard to be worth the 
effort; so says a large contingent of wor¬ 
ried skeptics. They argue that the com¬ 
plexity of explicitly-parallel programm¬ 
ing—all those processes active at once, all 
those bits zinging around in every direc¬ 
tion—is simply too great for the average 
programmer to bear. Here again there are 
serious arguments pro and con, discussed 
in several of the articles following. It is 
worth noting, though, that researchers 
with parallel programming experience are 
in many cases the least fretful. Chuck Seitz 
of Caltech leads the group that built the 
Cosmic Cube multicomputer. According 
to Seitz, 

Programming experimental concur¬ 
rent computers like the Cosmic Cube is 
not much harder than programming 
sequential computers, if the problem 
lends itself to a concurrent solution and 
the system software and programming 
tools are no worse than what we are 
used to on sequential computers. 2 


Those of us who have worked on the Lin¬ 
da project (discussed in “Linda and 
Friends”) have noticed the same thing: 
Parallel programming (in our limited and 
preliminary experience) doesn’t seem to be 
harder in any fundamental way than con¬ 
ventional programming. Parallel pro¬ 
gramming may be too hot to handle—its 
proponents may be just about to fall off 
the edge of earth, even if they haven’t yet. 
But for now, most researchers seem will¬ 
ing to soldier on. 

Considerations like these have given rise 
to a research effort whose goal is to 
domesticate parallelism by providing lan¬ 
guages in which it is convenient to write 
explicitly-parallel programs. Designers of 
parallel languages hope to turn parallelism 
into an ordinary household programming 
technique, ready and waiting whenever 
compute-intensive applications call for it. 

What kind of parallel 
language? 

The articles that appear in this issue are 
by no means an exhaustive survey, but 
they do span the spectrum. It’s useful to 
classify the systems they discuss into 
parallel Algol-based languages (Ada and 
Linda), parallel Lisps and logic languages 
(Multilisp and Concurrent Prolog), and 
parallel functional languages (Paralfl). 
Note that, in the schismatic, ideology- 
ridden world of programmming language 
research, these alternatives are rarely con¬ 
sidered side-by-side. The underlying thesis 
of this special issue is that they must be. No 
fair-minded observer can ignore any of 
them. 

Consider the parallel Algol-based lan¬ 
guages first. One of the first entrants was a 
fascinating and influential fragment that 
has become known as CSP 3 ; Occam 4 is a 
complete language that derives from 
Hoare’s proposal. CSP’s simplicity and 
elegance have been widely admired 
(though seldom imitated, unfortunately). 
It has been argued on the other hand that 
the language is low-level in character (all 
message buffering, for example, must be 
implemented by the user) yet not particu¬ 
larly flexible. It is virtually impossible, for 
example, to implement a fully-asynchro- 
nous send in CSP, in other words a send 
operation that returns to the sending pro¬ 
cess immediately without awaiting recep¬ 
tion or acknowledgement by a receiving 


Designers of parallel 
languages hope to 
turn parallelism into 
an ordinary 
household 
programming 
technique. 


process. CSP nonetheless continues to be 
studied widely and to be highly influential. 
It has been taken up not only as a pro¬ 
gramming language but as a clean nota¬ 
tion for parallel algorithms and a good 
vehicle for analyzing parallel processing. 

Ada and Linda, the two Algol-based 
languages discussed in the first and second 
articles in this issue, are each very different 
from CSP. Ada was designed originally 
not for parallel applications but for 
embedded systems. Such systems, how¬ 
ever, are often made up of many concur¬ 
rent, communicating processes multi¬ 
plexed on a single processor. Ada provides 
the concurrent processes and the com¬ 
munication tools needed to support such 
structures and is therefore a natural can¬ 
didate for parallel programming as well. 
Since Ada is the official programming 
language of the U.S. Department of 
Defense, its place in history is secure. It’s 
interesting to note, nonetheless, that the 
language was originally greeted with con¬ 
siderable enthusiasm in academic circles 
and has since been almost completely 
abandoned, to the point where it is occa¬ 
sionally denounced by critics who don’t 
seem to know anything about it. This 
widely-noticed phenomenon is disturbing 
and worth pondering. The wheel will no 
doubt turn again. Linda for its part is not 
even the official language of the Liechten¬ 
stein Department of Defense, but it has at¬ 
tracted its share of attention nonetheless. 
Its philosophy of providing a small num¬ 
ber of simple but very-high-level com¬ 
munication operators, and then daring the 
implementors to support them efficiently, 
sets it sharply apart both from CSP and 
from Ada. 
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Note that it’s sometimes argued that 
Linda-style communication can be simu¬ 
lated in CSP, or CSP-style communica¬ 
tion in Ada, or Ada-style communication 
in Linda and so forth, the intention being 
to demonstrate that CSP (or Ada or Lin¬ 
da) is somehow better than the others. 
Such contentions are largely meaningless. 
The goal of programming language re¬ 
search is to devise good abstractions. Once 
a programmer has decided that he prefers 
CSP’s abstractions to Ada’s, what he 
needs is an efficient CSP implementation. 
The fact that he can simulate CSP in Ada 
is of little importance to him. All it proves 
is that one way of implementing CSP is to 
use Ada—and if you want CSP, there are 
likely to be much more efficient ways to 
get it. 


Parallel Lisps and 
Concurrent Prolog 

Most artificial intelligence programs are 
written in Lisp, and the Lisp family con¬ 
tinues to grow in popularity for applica¬ 
tions of all sorts. AI in particular seems 
like a logical domain for parallel program¬ 
ming. Many AI programs rely either on 
heuristic search through huge solution 
spaces, or on the repeated examination of 
large numbers of knowledge-base records. 
Both sorts of activity often seem to be 
parallelizable. It’s interesting to note that, 
in parallel AI research to date, there’s been 
more interest in special-purpose parallel 
architectures than in parallel program¬ 
ming technqiues. But as general-purpose 
parallel machines become available, it 
seems more than likely that AI researchers 
will want to run programs on them, for the 
same reasons they seek out fast uniproces¬ 
sors today. 

In developing parallel AI applications 
(and arguably within many other domains 
as well), some form of parallel Lisp seems 
like a natural choice. Bert Halstead’s 
Multilisp is one such language; others have 
been proposed as well. Multilisp, how¬ 
ever, was one of the first parallel Lisps and 
is now one of the best developed. 
Halstead’s article describes the language 
and serves as a good general introduction 
to parallel programming besides. His 
“Parallel programming in the large” sec¬ 
tion raises issues that are among the most 


important and least sufficiently-discussed 
in parallel programming. 

As languages go, Prolog and Lisp don’t 
have very much to do with each other. 
Prolog researchers and Lisp researchers 
often address the same audience, though, 
and the natural application domains of 
these two languages overlap to some ex¬ 
tent. Prolog is one possible source lan¬ 
guage for a parallelizing interpreter or 
compiler, but it can also serve as a basis for 
a language that allows explicitly-parallel 
programming. Shapiro describes one such 
language. His work on Concurrent Prolog 
is noteworthy as well in ways that go 
beyond the logic programming and Prolog 
context. Most parallel languages support 
parallelism with pieces bolted onto a con¬ 
ventional framework. It’s arguably more 
satisfying to design a parallel language by 
revealing the latent parallelism in a lan¬ 
guage model’s foundations; this is what 
Shapiro has done, and his work is remark¬ 
able for its elegance and its conceptual 
economy. 

Functional languages 

Functional programming tops the 
charts as the uncontested hottest topic in 
programming languages today. Func¬ 
tional languages are probably less widely 
understood than they are admired, 
though; papers on the topic are often hard 
for nonspecialists to read, given the 
customary mathematical formalism. Paul 
Hudak’s discussion of a functional lan¬ 
guage for explicitly parallel programming 
is a lucid and readable introduction to the 
topic. 

Functional languages are simply one 
kind of programming language. They 
ought to be (and are entitled to be) judged 
by the same standards as all the rest. They 
emerge when we eliminate the idea of a 
program state that may be modified and 
manipulated by assignment statements, 
and redefine programming as exclusively 
the definition and evaluation of expres¬ 
sions. Proponents believe that the simple, 
elegant languages that result conduce to 
more orderly, more rigorous, more verifi¬ 
able, and ultimately more efficient pro¬ 
gramming. Opponents worry about losing 
expressivity as a result of the expression- 
evaluation-only model, and are troubled 
on sleepless nights by a variety of heretical 
thoughts. “You can’t write an AI program 


It's difficult to believe 
that the new forms 
that emerge will be 
completely unrelated 
to our current 
practices. 


in a functional language,” says Drew 
McDermott of Yale, 5 coauthor of the 
well-known AI programming text. Sys¬ 
tems programmers are easily upset by a 
model that seems to collide head-on with 
basics like clocks, semaphores, and physi¬ 
cal devices under program control. Propo¬ 
nents describe functional languages as 
natural tools for scientific programming. 
But Keshav Pingali of MIT remarks that 
Professor Arvind’s group conducted an 
experiment in which they translated a two- 
thousand line numerical Fortran code into 
a functional language. The result was 
three thousand lines long, and the experi¬ 
ment convinced them that the complex¬ 
ities of dealing with large data structures 
made functional languages unworkable 
for this kind of application. 6 They are 
now programming in a new form of lan¬ 
guage proposed by Pingali in which func¬ 
tional characteristics are augmented with 
logic-language features. 

Many other groups, of course, continue 
to experiment with functional program¬ 
ming. It’s to be hoped that an ever- 
increasing proportion of the computing 
community will read their papers carefully 
and benefit from their new ideas. After all, 
the designers of functional languages are 
estimable researchers. (“Their proponents 
are often brilliant intellectuals” notes 
James Morris, who is one of them. 7 ) But 
there is nothing magic, sinister, or incom¬ 
prehensibly deep in these languages, and 
readers are urged to judge them in the 
same way they judge other language 
proposals. 

What next? 

All these various languages are fine for 
the present, observers have been known to 
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remark, but what will happen when we 
have really big machines, parallel ma¬ 
chines with tens of millions of nodes and 
beyond? How can we possibly learn any¬ 
thing about these hyperparallel computers 
from the present generation of puny 
specimens? Won’t computing have to 
change in thoroughgoing, fundamental 
ways if we are ever to manage these huge 
machines? For that matter, won’t pro¬ 
gramming as we know it inevitably disap¬ 
pear in the process? 

It’s impossible to doubt that computing 
will change radically. But it’s almost as 
difficult to believe that the new forms that 
emerge will be completely unrelated to our 
current practices. Some hyperparallel 
machines will be assembled out of simple, 
tightly-coupled elements. Others, how¬ 
ever, will no doubt be generalizations of 
current designs in which each computing 
element is a powerful, self-sufficient unit. 
(Such a machine’s computing nodes 
should differ from a current machine’s 
mainly in providing strong support for 
some kind of higher-level parallel pro¬ 
gramming model.) Users of this latter kind 


of machine will presumably be able to 
count on help from sophisticated pro¬ 
gram-generating tools. Source programs 
may ultimately look more like concise, 
formal descriptions than most programs 
do now; such formal descriptions might 
conceivably result from informal ex¬ 
changes in English between the user and a 
description-synthesizer program. But 
however rarefied our source programs 
become, some entity will have to under¬ 
stand how to transform them into assem¬ 
blages of real, simultaneously-executing 
processes. 

Parallel language research is developing 
this understanding right now. And those 
who wait expectantly for the demise of 
programming as we know it are advised, 
finally, to consider the goal of the 1957 
designers of Fortran: the elimination of 
programming. Fortran was an automatic 
coding system, designed to allow pro¬ 
grams to be replaced by quasi-mathemati- 
cal formulas. 8 As the designers under¬ 
stood it, their goal was largely achieved. Of 
course, the same problems of designing, 
coding, debugging, and testing familiar in 


assembly programming simply reemerged 
at a new and higher level. Advocates of 
ultra-high-level “non-programming” are 
advised to expect the same. 

F or now, parallel-language research¬ 
ers are using their systems as labora¬ 
tories, and the results are changing 
our understanding of programming lan¬ 
guages and of the process of computation. 
They are seeking, at the same time, to 
bring about a qualitative change in the role 
of computers by enormously increasing 
the speed at which programs execute. 
These projects should keep them off the 
streets for some time to come. □ 
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Domesticating Parallelism 


Parallel Processing 
in Ada 



Ada's tasking 
mechanism 
represents a bold 
attempt to allow 
large, complex 
parallel processing 
applications to be 
written in a portable 
high-level language. 


A da was designed from the begin¬ 
ning with parallel processing ap¬ 
plications in mind. Its tasking 
mechanism is a coherent response to the 
language issues involved in parallel pro¬ 
cessing, and carefully balances the often 
conflicting goals of high-level language 
features on the one hand and efficient im¬ 
plementation on the other. The purpose of 
this discussion is to place the design of 
Ada’s parallel processing in its proper 
historical and technical context. In the 
process we will show how Ada itself has 
clarified some issues and thus established 
trends in language design. 

The real-time 
environment 

In the mid 1970’s, when the Common 
Language Project first began, more than 
five hundred languages were in use for De¬ 
fense Department programming, typically 
in large programs containing a hundred 
thousand lines or more of source code. 
They were not only large, but also in¬ 
herently complex, demanding real-time 
programs with stringent timing and relia¬ 
bility requirements. The difficulties inher¬ 
ent in such programs were compounded 
by the inadequacies of the languages used. 
The great number of languages was itself a 
problem, since it guaranteed that code 
could not be reused and that every new ap¬ 
plication had to start from scratch. Fur¬ 
thermore, although many of them quali¬ 
fied as high-order languages, or HOLs, 
almost none of them had provisions for 
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handling real-time or parallel processing, 
and requirements for efficiency often pre¬ 
cluded the use of HOLs for critical real¬ 
time aspects of computation. Consequent¬ 
ly, programs were routinely littered with 
machine code insertions to do the real¬ 
time processing, largely vitiating the bene¬ 
fits of using HOLs. Sixty to ninety percent 
of the source lines were assembly language 
insertions, even in HOL programs. Final¬ 
ly, the programs in this environment were 
long-lived and continuously changing, 
with resultant increases in the difficulty of 
maintenance. 

The results were predictable: software 
costs escalated and the reliability of the re¬ 
sultant systems suffered. Ada’s mission, 
then, was to provide the general structures 
and language features relevant to the de¬ 
sign and implementation of large, reliable, 
real-time programs of inherent complexity. 

However, Ada was also intended for 
time-critical applications where the com¬ 
puting resources may be limited by space, 
weight, power, or other physical con¬ 
straints. Avionics computers, for exam¬ 
ple, are limited because they must be on 
board the aircraft. Thus Ada’s design was 
influenced by the need for runtime effi¬ 
ciency. This led to a philosophy of compil¬ 
ation that provides as much generality and 
flexibility as possible at compile time, in 
combination with a relatively static run¬ 
time model. The more dynamic features, 
of necessity more expensive at runtime, 
are isolated so that their use cannot be ac¬ 
cidental or unavoidable. The language 
clearly sets apart, for example, uncon- 
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strained objects whose size can only be de¬ 
termined at runtime. 

Ada also requires more information 
about the programmer’s intent than do 
traditional languages. This allows the 
compiler to do a better job of optimizing 
its output. The private parts of packages, 
for example, supply information to the 
compiler about the size of hidden data 
types so that better code can be generated. 
Another example is Ada’s strong typing, 
which lets the compiler optimize data rep¬ 
resentations as a function of the limited 
ways in which those representations will be 
used. Strong typing means not only that 
the user establishes conventions for the use 
of a given type, but also that the compiler 
enforces those conventions. As a last 
resort, Ada even allows the extreme form 
of runtime optimization: the dreaded ma¬ 
chine code insertion, which although not 
recommended does give the compiler an 
unequivocal statement of the program¬ 
mer’s intentions regarding code generation. 

Finally, the concern for runtime effi¬ 
ciency led Ada’s designers to impose 
restrictions on certain language features to 
ensure that they could be implemented ef¬ 
ficiently. We will discuss some of these 
restrictions later. 

Ada's tasking model 

Macro-parallelism versus micro-paral¬ 
lelism. The parallel processing model that 
Ada introduced to meet its conflicting 
goals is based on the task. By model we 
mean the relation between the program 
structure and the multiple, possibly vir¬ 
tual, processors on which the program will 
be executed. One important attribute of a 
model is the language level at which it in¬ 
troduces parallelism. Some models at¬ 
tempt to introduce parallelism at the ex¬ 
pression level, and have the compiler 
attempt to distribute the evaluation of ex¬ 
pressions onto multiple processors when it 
can be shown that their several terms are 
independent. Another class of models 
seeks out parallelism at the statement level 
and uses flow analysis to determine which 
statements within a procedure can be exe¬ 
cuted in parallel. Finally, some models in¬ 
troduce concurrent processing at the sub¬ 
routine level and assume that any given 
subroutine (or process) will be executed se¬ 
quentially on just one processor, but that 


multiple processes may be executed in 
parallel. 

Ada views expression models and state¬ 
ment models as more relevant for the im¬ 
plementation of optimizing compilers 
than for language design. The micro-level 
concurrency they provide may be appro¬ 
priate in such applications as signal pro¬ 
cessing or when targeting special-purpose 
computer hardware such as systolic ar¬ 
rays. It is the macro-level concurrency 
provided by process-level models, how¬ 
ever, that is fundamental to the sorts of en¬ 
vironments for which Ada was designed. 
These are large-scale real-time applica¬ 
tions involving concurrent monitoring of a 
variety of sensors and concurrent control 
of a variety of actuators. Such applica¬ 
tions include avionics software to control 
aircraft in flight, communications soft¬ 
ware of the type that might be found in a 
satellite network, military command and 
control applications, and industrial pro¬ 
cess control software such as that used to 
control a refinery. The process model is es¬ 
sential to such applications, so Ada leaves 
fine-grained parallelism to the compiler 
writers while building macro-level concur¬ 
rency into the language on the ground 
floor. 

Task synchronization. Perhaps the 
largest single issue facing languages that 
adopt a process model is the question of 
how multiple processes (or tasks) syn¬ 
chronize with one another. Such synchro¬ 
nization is necessary for two reasons. 
First, parallel processing languages must 
provide a mechanism for mutual exclusion 
so that multiple processes are prevented 
from accessing a common system resource 
simultaneously. Second, processes must 
synchronize with each other when they ex¬ 
change data; the sending process must not 
begin transmitting until the receiver is 
ready. 

A wide variety of mechanisms have 
been proposed to ensure such process syn¬ 
chronization. We may distinguish between 
low-level and high-level primitives. Low- 
level primitives such as semaphores and 
signals are generally unstructured; they of¬ 
fer efficiency and relative ease of imple¬ 
mentation, and are powerful enough to 
implement any desired parallel processing 
algorithm. On the other hand, they are 
dangerous because they can lead to pro¬ 
grams that are difficult to develop, debug, 


Perhaps the largest 
issue facing 
languages that adopt 
a process model is 
how multiple 
processes 
synchronize with 
one another. 


and maintain. It is, for example, all too 
easy to program a wait on a semaphore but 
to program the corresponding signal in 
such a way that it is sometimes not exe¬ 
cuted, so that the system hangs. High-level 
or structured primitives such as coroutines 
and monitors are easier to program, but 
do not always provide the needed func¬ 
tionality and may lead to more expensive 
implementations. 

Because Ada was designed for very large 
applications that may last decades and 
that are continually undergoing upgrades, 
software reliability and the ability to main¬ 
tain that reliability in the face of constant 
program change throughout the mainte¬ 
nance period were major considerations in 
the Ada design. These considerations re¬ 
quired that the parallel processing features 
of Ada be high-level language features 
consistent with modern high-level lan¬ 
guage design principles, including, for 
example, structured primitives for task 
synchronization and multiway waiting 
(where a process is blocked while waiting 
to provide or to be provided with one of 
several services), since these features are 
required to support strong typing, data ab¬ 
straction, and encapsulation. 

The requirement for efficient code in 
critical real-time applications militated 
against using a scheme whose underlying 
implementation model was based on 
message-passing, since timing in message¬ 
passing systems is generally unpredictable 
due to their dynamically varying storage 
requirements. Instead, Ada’s rendezvous 
mechanism uses an implicit queuing 
scheme that can be implemented using 
semaphores. When a task needs to syn¬ 
chronize with another, it requests a ren¬ 
dezvous. The points at which rendezvous 
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Figure 1. The usual implementation of Ada rendezvous. The calling task is suspended 
in a queue for the entry it is calling. 


can take place are specified by entries 
declared in the visible parts of tasks. An 
entry is like a procedure call and provides a 
service that one of the partners in the ren¬ 
dezvous requests by calling the entry and 
the other provides by executing an accept 
statement. If a partner is not available, the 
task requesting rendezvous is suspended 
until a partner becomes available. The 
Language Reference Manual 1 (Section 
9.5.15) states that this will be done by pro¬ 
viding one queue per entry, so that callers 
can be placed in the queue for that entry 
until a rendezvous is possible (see Figure 
1). As is the case in systems based on ex¬ 
plicit semaphores, this approach keeps the 
number of primitives small by using the 
same mechanism for both mutual exclu¬ 
sion and task synchronization: tasks in a 
rendezvous both exchange parameters and 
execute a critical region. 

Shared memory. Another issue facing 
process-model languages is the dichotomy 
between shared-memory systems and 
message-based systems. If the language as¬ 
sumes that the parallel processors share 
memory, then it is convenient and efficient 
to use that shared memory for interprocess 
synchronization and to allow multiple 
processes access to shared variables. How¬ 
ever, that approach virtually precludes ex¬ 
tending the parallel processing scheme to 
loosely coupled networks, which lack 
shared memory. Message-based systems, 


on the other hand, incur a substantial run¬ 
time performance penalty, but are more 
general in that they can be used in a wider 
range of configurations. 

Ada rendezvous provides a diplomatic 
compromise between these two ap¬ 
proaches. Because the synchronization 
mechanism is implicit, it can be imple¬ 
mented efficiently both within shared 
memory systems and across loosely cou¬ 
pled networks. Multiple tasks can access 
shared variables, but in a restricted way 
that allows for loosely coupled systems. 
Updating is guaranteed only at synchro¬ 
nization points, so that the shared vari¬ 
ables can in fact reside in separate memo¬ 
ries, with copies passed back and forth 
only during rendezvous. 2 

Other characteristics of 
Ada tasking 

In this section we examine how tasks fit 
into the language framework provided by 
Ada. From this point of view an Ada task 
is similar to an Ada package, except that 
the objects allowed in its visible part are 
limited to entries (the services provided 
and requested in a rendezvous). The body 
of a package is an initialization routine 
that executes in sequence with the program 
that imports it, whereas the body of a task 
is a routine that has its own control path 
and executed in parallel with the program 


that declares it. Tasks are subject to the 
same visibility and scoping rules as the 
other features of the language. However, 
tasks are asymmetric with respect to the 
way they refer to each other during rendez¬ 
vous. The task making the entry call must 
name the task with which it seeks to ren¬ 
dezvous, but the accepting task need not. 
As we shall see later, this asymmetry is in 
fact unnecessary and an unfortunate con¬ 
sequence of the queue-per-entry imple¬ 
mentation model. 

Task activation. The requirements for 
reliability and ease of programming 
argued against explicit task activation in 
Ada, since implicit activation avoids many 
programming errors caused by incorrect 
activation sequences, such as the acciden¬ 
tal reactivation of a task. An Ada task is 
allocated at the point of its declaration, 
but its execution does not start until its 
parent reaches the begin in its body part. 
The language is sometimes criticized for 
this, on the grounds that requiring explicit 
task activation somehow gives the pro¬ 
grammer greater control over the activa¬ 
tion sequence. In actuality, task types al¬ 
low the activation of tasks at any point in 
the scope of the task type. A further level 
of control over the activation sequence can 
be had at the cost of an additional initial¬ 
ization entry used to suspend the spawned 
task at the point it begins executing. 

Error handling. Robust error handling 
is crucial in the real-time environments for 
which Ada was designed. Ada’s excep¬ 
tions provide a general mechanism for 
handling run-time errors that is well suited 
to coping with intratask error propaga¬ 
tion, but has its drawbacks when com¬ 
bined with tasking. For example, Ada’s 
rules say that exceptions raised while a 
task’s variables are being initialized cause 
that task to die and an exception to be 
raised in its parent, so that if a task wants 
to handle such exceptions itself, it must 
resort to such tricks as declaring all its 
variables inside a declare statement. 

Ada provides for another form of asyn¬ 
chronous task communication besides ex¬ 
ceptions, namely the abort statement, 
which is in effect an exception for which 
no exception handler is allowed. This 
greatly limits its utility. Moreover, the im¬ 
plicit abort raised in the aborted task’s 
children makes abort extremely expensive 
to implement. 
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Access to the environment. The final 
language issue we consider is the way in 
which parallel processing features are 
mapped onto the underlying real-time en¬ 
vironment. An approach commonly used 
in high-level languages not originally 
designed to support parallel processing is 
to supplement the language with prede¬ 
fined calls on operating system intrinsics. 
The drafters of the Ada requirements 
realized that while such an approach of¬ 
fers a certain degree of machine indepen¬ 
dence, it will not permit program porta¬ 
bility, since the resulting programs are 
dependent on the underlying operating 
system. They thus opted for an approach 
where all such operating system calls are 
implicit. Tasking in Ada is meant to be im¬ 
plemented in a runtime package that man¬ 
ages tasks by calling concurrency primi¬ 
tives in the underlying hardware or 
operating system. To move Ada from one 
environment to another, it is necessary to 
rewrite only the code generator and run¬ 
time package. 

True to the goal of providing access to 
the underlying hardware in as machine- 
independent a way as possible, Ada allows 
machine interrupts to be attached to 
parameterless entries by using representa¬ 
tion specifications, so that they can be 
handled without resorting to machine lan¬ 
guage. It is true that library units declaring 
such tasks, like most programs using rep¬ 
resentation specifications, are not por¬ 
table. However, such machine-dependent 
code can be encapsulated in low-level 
packages and not scattered throughout the 
programs that use them. A similar philos¬ 
ophy is evident in the handling of the real¬ 
time clock: the package calendar and the 
delay statement provide a uniform means 
of accessing the underlying hardware. 

We feel that all these features make Ada 
an ideal candidate for programming 
parallel applications on multicomputers 
such as the Encore Multimax, Sequent 
Balance, ELXSI, or Alliant. The parallel¬ 
ism of applications such as systems simula¬ 
tion, parallel AI programs that search 
large databases, and so on is for the most 
part large-grained parallelism that maps 
smoothly onto Ada tasking. Program¬ 
ming such applications in Ada is largely a 
matter of isolating the computations that 
can be carried on in parallel, defining the 
points at which those computations must 
synchronize with each other, and express¬ 
ing the parallel computations and syn- 


package print_spooler is 

entry print(to_be_printed: file_name); 

end print spooler; 

package body print_spooler is 

task line_printer; 

task laser_printer; 

task body line_printer is 

loop 

accept print(to_be_printed: file_name) do 

end accept; 

-do printing here 
end loop; 
end line_printer; 

task body line_printer is 

loop 

accept print(to_be_printed: file_name) do 

end accept; 

— do printing here 
end loop; 
end line_printer; 

end print_spooler; 


Figure 2. Print spooler with task 
generalization. 


chronization points as tasks and rendez¬ 
vous, respectively. One advantage Ada 
possesses is that, as we have seen, it ex¬ 
ploits shared memory when the underlying 
hardware provides it, but does not require 
that it be provided. Whether or not Ada is 
appropriate for implementing applica¬ 
tions with finer-grained parallelism, such 
as signal processing or matrix multiplica¬ 
tion, depends largely on the implementa¬ 
tion of the language. To be sure, such ap¬ 
plications can be expressed in Ada, but 
unless the compiler optimizes certain 
special cases, the overhead associated with 
tasking in all its generality may outweigh 
the benefits of parallelism when the ap¬ 
plication involves a large number of small 
tasks. Compiler support is also required 
for introducing parallelism on the expres¬ 
sion level. 

Some limitations of Ada 
tasking 

We have argued that Ada’s parallel pro¬ 
cessing facilities represent a unified, con¬ 
sistent response to the problems of inte¬ 
grating parallel processing features into a 
high-level language. Ada does impose a 
number of restrictions on tasking, how¬ 


ever, and we conclude by examining three 
of them. First, calling tasks must name the 
task whose entries they are calling, al¬ 
though the called task need not name the 
caller. Second, entries must be declared 
within the task that accepts them. Finally, 
Ada’s select statement, which provides 
multiple simultaneous rendezvous re¬ 
quests, does not allow mixed entry calls 
and accepts, nor multiple calls. Given the 
queue-per-entry implementation model 
we discussed above, these restrictions are 
necessary for efficient implementation. 

To see the impact that Ada’s restrictions 
have on tasking programs, consider the 
following example. Printers of several dif¬ 
ferent kinds are available, each with its 
own driver task. We do not want to force 
the users of the printers to know about all 
the different kinds of printers available; 
instead, we want to define a single printing 
service that all the print drivers will pro¬ 
vide. To that end, we wish to write a print 
spooler that will accept a file name from a 
customer task. To keep the example sim¬ 
ple, we assume that the spooler has no 
queue of file names, but rather delays the 
customer task until a printer is available. 
The customer will then be allowed to con¬ 
tinue while the printer prints out the file. 

An obvious way to write such a program 
is to define an entry “print(my_file: 
file_name)” that can be called by any user 
task and accepted by any of the various 
printer drivers. Figure 2 illustrates such an 
approach. Unfortunately, it is not possible 
to write the spooler this way in Ada. Since 
entries can be declared only in the visible 
part of the task that accepts them, only 
one task may accept a given entry. As 
shown in Figure 3, the usual solution to 
this problem is to introduce a turnaround 
task whose only function is to act as an in¬ 
termediary between the customer and 
server tasks. The “print” entry is used to 
synchronize the customer task with the 
spooler, so an additional entry must be de¬ 
clared to synchronize the spooler with the 
server tasks. The spooler task first waits 
for a rendezvous with a customer request¬ 
ing a print service. The customer is then 
held up while the spooler does a second ac¬ 
cept, this time to rendezvous with a print 
driver. When that rendezvous occurs, the 
file name is passed from the customer to 
the printer driver and both rendezvous are 
terminated. 

We note in passing that the generalized 
form of tasking would make it simple to 
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task print_spooler is 

entry print(to_be_printed: file_name); 

entry get_name(file_to_print: in out file_name); 
end print_spooler; 

task body print_spooler is 

task line_printer; --one task per physical printer 

task laser_printer; 

task body line_printer is 

file_to_print: file_name; 


prinLspooler. get_name(file_to_print) ; 

-do the printing here 
end loop; 
end line_printer; 

task body laser_printer is 

file_to_print: file_name; 


print—spooler.get_name(file_to_print); 

-do the printing here 

end loop; 

end laser_printer; 

begin 

loop 

select 

accept print(to_be_printed: file_name) do 

accept get_name(file_to_print: in out file_name) do 

file_to_print: = to_be_printed; 

end accept; 
end accept; 
end loop; 

end printlspooler; 



e 3. Print spooler in Ada. 


change the program so that the calling task 
is blocked until the named file has actually 
been printed. Since in the generalized form 
the rendezvous is between the printer and 
the customer, this has no effect on other 
printers and customers. In Ada, however, 
the rendezvous is between the customer 
and the turnaround task. Blocking the 
customer would also block the spooler, 
and all other customers and servers as well. 
We leave it to the reader to sketch out the 
whole new layer of synchronization mech¬ 
anisms (perhaps an array of Boolean 
values for the printers that the spooler 
could use as guards, plus an extra “print¬ 
ing done” entry) that would be needed in 
Ada for this simple example. 

A further drawback of the queue-per- 
entry rendezvous model is that it does not 
guarantee fairness. That is, it is possible 
for a task to be permanently blocked 
waiting for a rendezvous. Consider an 
Ada task T with a select statement that ac¬ 


Figure 4. The multiway dilemma. How to avoid placing blocked tasks 
in more than one queue? Ada’s answer is to disallow multiway calling 
and to restrict multiway accepting to the entries declared by a single 
task. 


dezvous requests would mean that T 
would have to be placed on several differ¬ 
ent queues at once, one for each of the en¬ 
tries (see Figure 4). This is very expensive 
to implement. Not only is queue manage¬ 
ment made much more difficult by the 
necessity of scanning multiple queues, but 
the semantics of synchronization are hard 
to ensure. When T begins a rendezvous at 
one entry, it must be removed from all the 
other queues it was waiting in before any 
other task takes it out to begin a rendez¬ 
vous at one of the corresponding entries. 
This timing problem or “race condition” 
makes the rule that a task can only be in a 
single queue at a time so important. 

There are two distinct ways to avoid 
putting a task in multiple queues while pre¬ 
serving the queue-per-entry model. The 
first is simply to disallow multiway rendez¬ 
vous requests. When using this solution, 
tasks can attempt rendezvous with only 
one other task at a time, so only one queue 
is involved. The second solution is to im¬ 
pose a rule that says that multiway rendez¬ 
vous requests are permitted only when the 
multiple entries all belong to the same 
task. This allows elimination of the queue 
altogether, since the task containing the 
entries can simply be marked as either 
“waiting” or “busy.” For example, a task 
accepting five different entries does not 


cepts the two entries El and E2. Suppose 
that this task is implemented as described 
above, with separate queues for El and 
E2. Suppose further that there are enough 
calls on El to keep T permanently busy. 
Any caller on E2 will be permanently 
blocked, since T will continually rendez¬ 
vous with the callers in the queue for El. 
The only way to avoid this unfairness 
would be to have T scan all of its queues 
before entering rendezvous, looking for 
the oldest rendezvous request. Unfor¬ 
tunately, this would be a very expensive 
proposition. 

Why did Ada impose the restriction that 
entries belong to tasks in the first place? 
One possible answer is that it reconciles 
multiway waiting with the queue-per-entry 
model. Consider the general problem of 
allowing a task T to attempt several 
rendezvous at the same time. If the imple¬ 
mentation model uses one queue per en¬ 
try, allowing multiple simultaneous ren¬ 
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Figure 5. A generalized rendezvous mechanism. Entry queues are disassociated from 
tasks so that a single service can be provided by many different tasks. 
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need to be placed in five different queues 
when none of the entries have callers. In¬ 
stead, the accepting task is marked as 
“waiting” until one of the five entries is 
called. 

Now, multiway calling is of little value if 
all of the calls must be on entries in a single 
task. Multiway accepting, however, is cru¬ 
cial, even if all the entries being accepted 
must be in one task. So Ada makes the log¬ 
ical choice: it uses one solution on the call 
side, the other on the accept side. It disal¬ 
lows multiway calling, but allows multi¬ 
way accepting as long as the entries are all 
in the same task. 

Alternative implementations, however, 
allow unrestricted multiway rendezvous 
requests while guaranteeing fairness. Per¬ 
haps the simplest implementation, shown 
in Figure 5, uses a single system-wide 
queue for all blocked tasks. Although the 
number of tasks in a system is typically 
quite small, the length of the queue in the 
worst case could cause inefficient task 
switching. Fortunately, there is an optimal 
method of splitting the queue that avoids 
this problem. 2 

A da provides high-level mecha¬ 
nisms for task creation and syn¬ 
chronization that are machine 
independent and operating system inde¬ 


pendent. They can be efficiently imple¬ 
mented on distributed systems as well as in 
shared memory, and on multiprocessors 
as well as uniprocessors. 

Hindsight suggests that a more general 
synchronization mechanism would have 
simplified the programming of many ap¬ 
plications. It is well to bear in mind, 
however, that these inconveniences arise 
in applications that previously could not 
be described at all in a high-performance 
HOL without using machine language in¬ 
sertions. The theoretical advances that 
make the more general mechanisms ap¬ 
pealing have been made in the decade since 
the requirements for Ada were published, 
and were to a large degree stimulated by 
Ada’s innovative approach to parallel 
processing. □ 
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Domesticating Parallelism 


Linda and Friends 



Linda consists of a 
few simple primitives 
that support an 
"uncoupled" style of 
parallel programming. 
Implementations exist 
on a broad spectrum 
of parallel machines. 


L inda consists of a few simple opera¬ 
tors designed to support and simplify 
the construction of explicitly-parallel 
programs. Linda has been implemented 
on AT&T Bell Labs’ S/Net multicom¬ 
puter and, in a preliminary way, on an 
Ethernet-based MicroVAX network and 
an Intel iPSC hypercube. Although the 
implementations are new and need refine¬ 
ment, our early experiences with them 
have been revealing, and we take them as 
supporting our claim that Linda is a pro¬ 
mising tool. 

Parallel programming is often described 
as being fundamentally harder than con¬ 
ventional, sequential programming, but in 
our experience (limited so far, but grow¬ 
ing) it isn’t. Parallel programming in Lin¬ 
da is conceptually the same order of task 
as conventional programming in a sequen¬ 
tial language. Parallelism does, though, 
encompass a potentially difficult problem. 
A conventional program consists of one 
executing process, of a single point in 
computational time-space, but a parallel 
program consists of many, and to the ex¬ 
tent that we have to worry about the rela¬ 
tionship among these points in time and 
space, the mood turns nasty. Linda’s mis¬ 
sion, however, is to make it largely 
unnecessary to think about the coupling 
between parallel processes. Linda’s un¬ 
coupled processes, in fact, never deal with 
each other directly. A parallel program in 
Linda is a spatially and temporally un¬ 
ordered bag of processes, not a process 
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graph. To the extent that process uncou¬ 
pling succeeds, the difficulty of designing, 
debugging, and understanding a parallel 
program grows additively and not multi- 
plicatively with the variety of processes it 
encompasses. 

When the simple operators Linda pro¬ 
vides are injected into a host language h , 
they turn h into a parallel programming 
language. A Linda system consists of the 
runtime kernel, which implements inter¬ 
process communication and process man¬ 
agement, and a preprocessor or compiler. 
A Linda-based parallel language is in fact 
a new language, not an old one with added 
system calls, to the extent that the prepro¬ 
cessor or compiler recognizes the Linda 
operations, checks and rewrites them on 
the basis of symbol table information, and 
can optimize the pattern of kernel calls 
that result based on its knowledge of con¬ 
stants and loops, among other things. 
Most of our programming experiments 
so far have been conducted in C-Linda 
(and we use C-Linda for the examples 
below), but we have implemented a 
Fortran-Linda preprocessor as well. The 
kernel is language-independent. It will 
support A-Linda for any language N. 

Associated with the Linda operators is a 
particular programming methodology, 
based on distributed data structures. (The 
language doesn’t restrict programmers to 
this methodology. It merely allows the 
methodology, which most other languages 
don’t.) The distributed-data-structure 
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methodology in turn suggests a particular 
strategy for dealing with parallelism. Most 
models of parallelism assume that a pro¬ 
gram will be parallelized by partitioning it 
into a large number of simultaneous activ¬ 
ities. This partitioning appears, however, 
to be relatively difficult to do, especially 
when we consider large multicomputers 
that support thousand-fold parallelism 
and beyond. In the Linda framework, we 
can get parallelism by replicating as well as 
by partitioning. We anticipate that it will 
frequently be simpler to stamp out many 
identical copies of one process than to 
create the same number of distinct pro¬ 
cesses. So the final ingredient in the Linda 
framework is a strategy for coping with 
parallelism by replication rather than by 
partitioning. 

In the following we discuss first what it 
seems to us that parallel programmers 
need. We then describe Linda, some pro¬ 
gramming experiments using Linda, the 
current implementation, and the project 
now underway to go beyond the current 
software implementation and build a 
hardware Linda machine. We go on to dis¬ 
cuss some related higher-level parallel lan¬ 
guages that can be implemented on top of 
the Linda kernel, particularly the sym¬ 
metric languages. We close in a blaze of 
speculation. 

What parallel 
programmers need 

Many parallel algorithms are known 
and more are in development; many paral¬ 
lel machines are available and many more 
will be soon. But the fate of the whole ef¬ 
fort will ultimately be decided by the ex¬ 
tent to which working programmers can 
put the algorithms and the machines to¬ 
gether. The needs of parallel programmers 
have not been accommodated very well to 
date. 

A machine-independent and (poten¬ 
tially) portable programming vehicle. De¬ 
signers of parallel languages generally hold 
that programming tools should accommo¬ 
date a high-level programming model, not 
a particular architecture. But as parallel 
machines emerge commercially, there has 
been little effort spent on making high- 
level, machine-independent tools avail¬ 
able on them. Young debutante machines 
are sometimes gotten-up in their own full¬ 


blown parallel languages; more often they 
come dressed in only a handful of idiosyn¬ 
cratic system calls that support the local 
variant of message-passing or memory¬ 
sharing. In either case, so long as each new 
machine is provided with its own parallel 
programming system, programs for multi¬ 
computer x will not only have to be re¬ 
coded, they may need to be conceptually 
reformulated to run on multicomputer y. 
(This is particularly true if x is a shared- 
memory machine like a BBN Butterfly 1 or 
an IBM RP3 2 and y is a network, like an 
Intel iPSC.) But users need to be able to 
run parallel programs on a range of archi¬ 
tectures, particularly now, when interest¬ 
ing designs of unknown merit proliferate. 
They need to be able to communicate par¬ 
allel algorithms. Methodological knowl¬ 
edge can’t grow when sources are cluttered 
with local dialect. Finally, they need pro¬ 
gramming tools suited to their needs, not 
to the machine’s. 

A programming tool that absolves them 
as fully as possible from dealing with 
spatial and temporal relationships among 
parallel processes. We referred to the 
general problem of uncoupling above. 
Uncoupling has both a spatial and a tem¬ 
poral aspect. Spatially, each process in a 
parallel program will usually develop a re¬ 
sult or a series of results that will be ac¬ 
cepted as input by certain other processes. 
Uncoupling suggests that process q should 
not be required to know or care that pro¬ 
cess j accepts q’s data as input. Instead of 
requiring q to execute an explicit “send 
this data to j” statement, we would rather 
that q be permitted simply to tag its new 
data with a logical label (for example, 
“new data from q”) and then forget about 
it, under the assumption that any process 
that wants it will come and get it. At some 
later point in program development, a dif¬ 
ferent process may decide to deal with q’s 
data. Under the spatially-uncoupled 
scheme, this won’t matter to q. 

Temporal uncoupling involves similar 
though perhaps slightly more subtle issues. 
If q is forced to send toy explicitly, the sys¬ 
tem is constrained to have both processes 
loaded simultaneously (or at least to have 
buffer space allocated for./' when q runs). 
Further, most parallel languages attach 
some form of synchronization constraint 
to send. A synchronized send operation 
like Ada’s entry call or CSP-Occam’s out¬ 


put statement forces the system not merely 
to load but to run the receiving process be¬ 
fore the sender can continue. We would 
rather that our parallel programs be large¬ 
ly free of scheduling implications like 
these. Not only do they constrain the sys¬ 
tem in ways that may be undesirable, but 
they force programmers to think in simul¬ 
taneities. As far as possible, we would like 
programmers to be able to develop q’s 
code without having to envision other si¬ 
multaneous execution loci. To achieve 
this, we would like q to be allowed to take 
each new datum it develops and heave it 
overboard without a backwards glance. 
(We make this a bit more concrete below.) 

A programming tool that allows tasks 
to be dynamically distributed at runtime. 

Generally there is more logical parallelism 
in a parallel algorithm than physical paral¬ 
lelism in a host multicomputer, which 
means that at runtime there are more 
ready tasks than idle processors. Good 
speedup obviously requires that tasks be 
evenly distributed among available pro¬ 
cessors. Many systems require that this 
distribution be performed statically at 
load time. Sometimes, finding a good stat¬ 
ic distribution is easy, notably when the 
program’s logical structure matches the 
machine’s physical structure. As the pro¬ 
gram’s logical structure grows more 
irregular, the task gets harder, and when 
the program’s computational focus devel¬ 
ops dynamically at runtime, finding a 
good static mapping may be impossible. 
Many important applications and pro¬ 
gram structures fall into the first, easily- 
handled category, but many more do not. 
For those that don’t, dynamic distribution 
of tasks is essential. 

A programming tool that can be imple¬ 
mented efficiently on existing hardware. 

Obviously. Parallel language research has 
produced far more designs than implemen¬ 
tations. Elegant language ideas will always 
be interesting regardless of the existence of 
good implementations, but parallel pro¬ 
grammers, as opposed to language re¬ 
searchers, require implementable elegance. 

Linda 

Linda centers on an idiosyncratic mem¬ 
ory model. Where a conventional mem¬ 
ory’s storage unit is the physical byte (or 
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something comparable), Linda memory’s 
storage unit is the logical tuple, or ordered 
set of values. Where the elements of a con¬ 
ventional memory are accessed by ad¬ 
dress, elements in Linda memory have no 
addresses. They are accessed by logical 
name , where a tuple’s name is any selec¬ 
tion of its values. Where a conventional 
memory is accessed via two operations, 
read and write, a Linda memory is ac¬ 
cessed via three: read, add, and remove. 

It is a consequence of the last character¬ 
istic that tuples in a Linda memory can’t 
be altered in situ. To be changed, they 
must be physically removed, updated, and 
then reinserted. This makes it possible for 
many processes to share access to a Linda 
memory simultaneously; using Linda we 
can build distributed data structures that, 
unlike conventional ones, may be manipu¬ 
lated by many processes in parallel. 
Furthermore, as a consequence of the first 
characteristic (a Linda memory stores 
tuples, not bytes), Linda’s shared memory 
is coarse-grained enough to be supported 
efficiently without shared-memory hard¬ 
ware. Shared memory has long been 
regarded as the most flexible and powerful 
way of sharing data among parallel pro¬ 
cesses, but a naive shared memory requires 
hardware support that is complicated, ex¬ 
pensive to build, and suitable only for 
multicomputers, not for local area net¬ 
works. Linda’s variant of shared memory, 
on the other hand, runs both on the S/Net 
and on a MicroVAX network, neither of 
which provides any physically shared 
memory. (Of course, Linda may be im¬ 


plemented on shared-memory multicom¬ 
puters as well, as we discuss below.) 

Linda’s shared memory is referred to as 
tuple space, or TS. Messages in Linda are 
never exchanged between two processes 
directly. Instead, a process with data to 
communicate adds it to tuple space and a 
process that needs to receive data seeks it, 
likewise, in tuple space. There are four op¬ 
erations defined over TS: out(), in(), 
read(), and eval(). out(0 causes tuple t to 
be added to TS; the executing process con¬ 
tinues immediately, in (s') causes some 
tuple t that matches template s to be 
withdrawn from TS; the values of the ac¬ 
tuals in t are assigned to the formals in s 
and the executing process continues. If no 
matching t is available when in(s) executes, 
the executing process suspends until one 
is, then proceeds as before. If many 
matching f s are available, one is chosen 
arbitrarily, read(s) is the same as infs), with 
actuals assigned to formals as before, ex¬ 
cept that the matched tuple remains in TS. 
For example, executing 
out(“P”, 5, false) 

causes the tuple (“P”, 5, false) to be added 
to TS. The first component of a tuple 
serves as a logical name, here “P”; the re¬ 
maining components are data values. Sub¬ 
sequent execution of 
in(“P”, int i, bool b) 

might cause tuple (“P”, 5, false) to be 
withdrawn from TS. 5 would be assigned 
to i and false to b. Alternatively, it might 
cause any other matching tuple (any other, 
that is, whose first component is “P” and 


whose second and third components are 
an integer and a Boolean, respectively) to 
be withdrawn and assigned. Executing 
read(“P”, int i, bool b) 
when (“P”, 5, false) is available in TS may 
cause 5 to be assigned to i and false to b, or 
equivalently may cause the assignment of 
values from some other type consonant tu¬ 
ple, with the matched tuple itself remain¬ 
ing in TS in either case, eval(t) is the same 
as out(f), except that eval adds an unevalu¬ 
ated tuple to TS. (eval is not primitive in 
Linda; it will be implemented on top of 
out. We haven’t done this yet in S/Net- 
Linda, so we omit further mention of 
eval.) See Figures 1, 2, and 3. 

The parameters to an in() or read() 
statement needn’t all be formals. Any or 
all may be actuals as well. All actuals must 
be matched by corresponding actuals in a 
tuple for tuple-matching to occur. Thus 
the statement 

in(“P”, int /, 15) 

may withdraw tuple (“P”, 6, 15) but not 
tuple (“P”, 6, 12). When a variable ap¬ 
pears in a tuple without a type declarator, 
its value is used as an actual. The annota¬ 
tion formal may precede an already- 
declared variable to indicate that the pro¬ 
grammer intends a formal parameter. 
Thus, if i and j have already been declared 
as integer variables, the following two 
statements are equivalent to the preceding 
one: 

j = 15; in(“P”, formal i, j) 

This extended naming convention (it re¬ 
sembles the select operation in relational 
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databases) is referred to as structured 
naming. Structured naming makes TS 
content-addressable, in the sense that pro¬ 
cesses may select among a collection of 
tuples that share the same first component 
on the basis of the values of any other com¬ 
ponent fields. Any parameter to out() or 
eval() except the first may likewise be a for¬ 
mal; a formal parameter in a tuple matches 
any type-consonant actual in an in or read 
statement’s template. See Figure 4. 


Programming in Linda 



Linda accommodates the needs for un¬ 
coupling and dynamic scheduling we listed 
above by relying on distributed data struc¬ 
tures. As noted, a distributed data struc¬ 
ture is one that may be manipulated by 
many parallel processes simultaneously. 
Distributed data structures are the natural 
complement to parallel program struc¬ 
tures, but despite this natural relationship, 
distributed data structures are impossible 
in most parallel programming languages. 
Most parallel languages are based instead 
on what we call the manager process 
model of parallelism, which requires that 
shared data objects be encapsulated within 
manager processes. Operations on shared 
data are carried out, on request, by the 
manager process on the user’s behalf. See 
Figures 5 and 6. 

The manager-process model has impor¬ 
tant advantages, and manager-process 
programs are easy to write in Linda. What 
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The processes in a 
partitioned-network 
program are coupled, 
while those in a 
replicated-worker 
program are 
uncoupled. 


is significant, though, is the number of 
cases in which distributed data structure 
programs come closer to achieving the 
qualities we want. They do so particularly 
in the context of parallel programs struc¬ 
tured not as logical networks but as col¬ 
lections of identical workers. In logical- 
network-style parallelism (the more 
common variety), a program is partitioned 
into n pieces, where n is determined by the 
logic of the algorithm. Each of the n 
logical pieces is implemented by a process, 
and each process keeps its attention 
demurely fixed on its own conventional, 
local data structures. In the replicated 
worker model, we don’t partition our pro¬ 
gram at all; we replicate it r times, where r 
is determined by the number of processors 
we have available. All r processes scramble 
simultaneously over a distributed data 
structure, seeking work where they can get 
it. There is a strong underlying sense in 
which the processes in a partitioned- 
network program are coupled, while those 
in a replicated-worker program are un¬ 
coupled. In the partitioned network pro¬ 
gram, each process must, in general, deal 
with its neighbor processes; in the 
replicated-worker program, workers ig¬ 
nore each other completely. The replicated 
worker model is interesting for a number 
of more specific reasons as well. 

1. It scales transparently. Once we have 
developed and debugged a program with a 
single worker process, our program will 
run in the same way, only faster, with ten 
parallel workers or a hundred. We need be 
only minimally aware of parallelism in de¬ 
veloping the program, and we can adjust 
the degree of parallelism in any given run 
to the available resources. 


2. It eliminates logically-pointless con¬ 
text switching. Each processor runs a 
single process. We add processes only 
when we add processors. The process- 
managment burden per node is exactly the 
same when the program runs on one node 
as when it runs on a thousand. (This is not 
true, of course, in the network model. A 
network program always creates the same 
number of processes. If many processors 
are available, they spread out; if there are 
only a few, they pile up.) 

3. It balances load dynamically, by 
default. Each worker process repeatedly 
searches for a task to execute, executes it, 
and loops. Tasks are therefore divided at 
runtime among the available workers. 

It’s important to note that most of the 
programs we’ve experimented with are not 
pure replicated-worker examples; they in¬ 
volve some partitioning as well as some 
replication of duties. It’s also true that 
purely network-style programs may be 
written in Linda and may rely on distrib¬ 
uted data structures. Linda programs that 
tend towards the replicated style seem to 
be the most idiomatic and interesting, 
though. 

We illustrate with a simple example that 
doesn’t exercise the mechanism fully, but 
makes some basic points. We’ve tested 
several matrix multiplication programs 
using S/Net-Linda. One version consists 
of a setup-cleanup process and at least 
one, but ordinarily many, worker pro¬ 
cesses. Each worker is repeatedly assigned 
some element of the product matrix to 
compute; it computes this assigned ele¬ 
ment and is assigned another, until all 
elements of the product matrix have been 
filled in. If A and B are the matrices to be 
multiplied, then specifically 

1. The initialization process uses a suc¬ 
cession of out statements to dump A’s 
rows and B’s columns into TS. When these 
statements have completed, TS holds 
(“A”, 1, /4’s-first-row) 

(“A”, 2, /t’s-second-row) 


(“B”, 1, B’s-first-column) 
(“B”, 2, fl’s-second-column) 


Indices are included as the second element 
of each tuple so that worker processes, 
using structured naming, can select the rth 


row oryth column for reading. The initial¬ 
izer then adds the tuple 
(“Next”, 1) 

to TS and terminates. 1 indicates the next 
element to be computed. 

2. Each worker process repeatedly 
decides on an element to compute, then 
computes it. To select a next element, the 
worker removes the “Next” tuple from 
TS, determines from its second field the 
indices of the product element to be com¬ 
puted next, and reinserts “Next” with an 
incremented second field: 

in(“Next”, formal NextElem); 
if(NextElem < dim * dim) 
out(“Next”, NextElem +1); 
i = (NextElem - l)/dim +1; 
j = (NextElem - l)%dim +1; 

The worker now proceeds to compute the 
product element whose index is (ij). 
Note that if (ij) is the last element of the 
product matrix, the “Next” tuple is not 
reinserted. When the other workers at¬ 
tempt to remove it, they will block. A Lin¬ 
da program terminates when all processes 
have terminated or have blocked at in or 
read statements. 

To compute element (ij) of the prod¬ 
uct, the worker executes 
read(“A”, i, formal row); 
read(“B”, j, formal col); 
out(“result”, i, j, DotProduct(row, col)); 

Thus each element of the product is packed 
in a separate tuple and dumped into TS. 
(Note that the first read statement picks 
out a tuple whose first element is “A” and 
second is the value of i\ this tuple’s third 
element is assigned to the formal row.) 

3. The cleanup process reels in the prod¬ 
uct-element tuples, installs them in the 
result matrix prod, and prints prod : 

for (row= 1; row< =NumRows; row++) 
for (col = 1; col < = NumCols; col ++) 
in (“result”, row, col, formal prod 
[row] [col]); 
print prod; 

This simple program depends entirely 
on distributed data structures. The input 
matrices are distributed data structures; all 
worker processes may read them simulta¬ 
neously. In the manager-process model, 
processes would send read-requests to the 
appropriate manager and await its reply. 
The “Next” tuple is a distributed data 
structure; all worker processes share direct 
access to it. In the manager process model, 
again, worker processes would read and 
update the “Next” counter indirectly via a 
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Figure 7. An S/Net kernel, (a) shows out: broadcast, while (b) shows in: check locally. 
(The inverse of this scheme is also possible.) 


manager. The product matrix is a distrib¬ 
uted data structure, which all workers par¬ 
ticipate in building simultaneously. 

We discuss the performance of this pro¬ 
gram, and of another version that assigns 
coarser-grained tasks that compute an en¬ 
tire row rather than a single inner product, 
elsewhere. 3 Both versions show good 
speedup as we add processors up to the 
limited number available to us on our 
S/Net (currently eight). The version 
discussed above requires only two parallel 
workers and one control process to beat a 
conventional uniprocessor C program on 
32x32 matrices, and continues to show 
linear speedup as we add workers. The 
coarser-grained version, with its lower 
communication overhead, shows speedup 
close to ideal linear speedup of the 
uniprocessor C version: our figures show 
close to a progressive doubling, tripling, 
and so on of the C program’s speed as we 
add Linda workers. 

The matrix program displays the un¬ 
coupling and dynamic-scheduling proper¬ 
ties that we claimed above were important. 
Uncoupling: no worker deals directly with 
any other. Dynamic task scheduling: the 
matrix program assigns tasks to workers 
dynamically. But of course, in a problem 
as simple and regular as matrix multiplica¬ 
tion, dynamic scheduling isn’t important. 
We could just as well have assigned each of 
n workers 1 In of the product matrix to 
compute. (It’s interesting to note, how¬ 
ever, that even with a problem as orderly 
as matrix multiplication, dynamic sched¬ 
uling might be the technique of choice if 
we were running on a nonhomogeneous 
network, on which processors vary in 
speed and in runtime loading. We’ve been 
studying just such a network—a collection 
of VAXes ranging from MicroVAX I’s to 
8600’s.) 

Dynamic scheduling becomes impor¬ 
tant when tasks require varying amounts 
of time to complete. In the general case, 
moreover, new tasks may be developed 
dynamically as old ones are processed. 
Linda techniques to deal with this general 
problem are based on a distributed data 
structure called a task bag. Workers 
repeatedly draw their next assignment 
from the task bag, carry out the specified 
assignment, and drop any new tasks 
generated in the process back into the task 
bag. The program completes when the bag 
is empty. The scheme is easily imple¬ 


mented in Linda. Elements of the bag will 
be tuples of the form 
(“Task”, task descriptor) 

Each worker executes the following loop: 
loop { 

/* withdraw a task from the bag: */ 
in(“Task”, formal NextTask); 
process “NextTask”; 
for (each NewTask generated in the 
process) 

/* drop the new task into the bag: V 
out(“Task”, NewTask); 

1 

We’ve experimented with programs of 
this sort to perform LU decomposition 
with pivoting and to find paths through a 
graph, among others. Note that, if it were 
necessary to process tasks in a particular 
order rather than in arbitrary order, we 
would build a task queue instead of a task 
bag. The technique would involve num¬ 
bered tuples and structured naming. 

The S/Net's Linda 
kernel 

Linda has often been regarded as posing 
a particularly difficult implementation 
problem. The difficulty lies in the fact 
that, as noted above, Linda supplies a 
form of logically-shared memory without 
assuming any physically-shared memory 
in the underlying hardware. The following 


paragraphs summarize the way in which 
we implemented Linda on the S/Net (the 
S/Net implementation is discussed in 
detail elsewhere 3 ); there are many other 
possible implementations as well. 

Our implementation buys speed at the 
expense of communication bandwidth 
and local memory. The reasonableness of 
this trade-off was our starting point. 
(Possible variants are more conservative 
with local memory.) 

Executing out(0 causes tuple t to be 
broadcast to every node in the network; 
every node stores a complete copy of TS. 
Executing in(s) triggers a local search for a 
matching t. If one is found, the local 
kernel attempts to delete t network-wide 
using a procedure we discuss below. If the 
attempt suceeds, t is returned to the pro¬ 
cess that executed in(). (The attempt fails 
only if a process on some other node has 
simultaneously attempted to delete t, and 
has succeeded.) If the local search trig¬ 
gered by in(5) turns up no matching tuple, 
all newly-arriving tuples are checked until 
a match occurs, at which point the matched 
tuple is deleted and returned as before. 
readO works in the same way as inO, ex¬ 
cept that no tuple-deletion need be at¬ 
tempted. As soon as a matching tuple is 
found, it is immediately returned to the 
reading process. See Figure 7. 

The delete protocol must satisfy two re¬ 
quirements. First, all nodes must receive 
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the “delete” message, and second, if 
many processes attempt to delete simulta¬ 
neously, only one must succeed. The man¬ 
ner in which these requirements are met 
will depend, of course, on the available 
hardware. 

When some node fails to receive and 
buffer a broadcast message, a negative- 
acknowledgement signal is available on 
the S/Net bus. One possible delete pro¬ 
tocol has two parts: The sending kernel 
rebroadcasts repeatedly until the negative- 
acknowlegement signal is not present. It 
then awaits an “ok to delete t” message 
from the node on which t originated. In 
this protocol the kernel on the tuple’s 
origin node is responsible for allowing one 
process, and only one, to delete it. (We 
have implemented other protocols as well. 
Processes may use the bus as a semaphore 
to mediate multiple simultaneous deletes, 
for example, and avoid the use of a special 
“ok to delete” message.) 

Evidence suggests that a minimal out-in 
transaction, from kernel entry on the out 
side to kernel exit on the in side, takes 
about 1.4 ms. 

We are working on other implementa¬ 
tions as well. The VAX-net work Linda 
kernel (which was designed and is being 



implemented by Jerry Leichter) uses a 
technique that is in a sense the inverse of 
the existing S/Net scheme. We’re in the 
process of trying this new technique on the 
S/Net also. In the new protocol, out re¬ 
quires only a local install; in(s) causes 
template s to be broadcast to all nodes in 
the network. Whenever a node receives a 
template s, it checks s against all of its 
locally-stored tuples. If there is a match, it 
sends the matched tuple off to the tem¬ 
plate’s node. If not, it stores the template 
for x ticks (checking all tuples newly- 
generated within this period against it), 
then throws it out. If the template’s origin 
node hasn’t received a matching tuple 
after x ticks, it rebroadcasts the template. 
More than one node may respond with a 
matching tuple to a template broadcast; 
when a template broadcaster receives 
more than one tuple, it simply installs the 
extras alongside its locally-generated 
tuples and sends them onward when 
they’re needed. (In a more elaborate ver¬ 
sion, we can forestall the arrival of un¬ 
needed tuples by having potential senders 
monitor the bus, or by broadcasting an 
“I’ve got one, enough already” message 
at the appropriate point.) This scheme 
doesn’t require hardware support for 
reliable broadcast and it doesn’t require 
tuples to be replicated on each node, so 
per-node storage requirements are much 
lower. 

The Linda kernel for the Intel iPSC 
hypercube, designed and implemented by 
Rob Bjornson, relies on point-to-point 
rather than broadcast communication. 
His scheme implements tuple space as a 
hash table distributed throughout the net¬ 
work. Each tuple is hashed on out to a 
unique network node and is sent there for 
storage. Templates are hashed and stored 
in the same way. 


Finally, several of us (Bjornson, Carri- 
ero, Leichter, and Gelernter) have begun, 
in conjunction with Scientific Computing 
Associates, to design and implement a 
Linda kernel for the Encore Multimax. 
Nodes on the Multimax have direct access 
to physically shared memory. The Multi¬ 
max Linda kernel should therefore be 
faster and simpler than the kernels 
described above, and in fact it is. The rela¬ 
tionship between Linda and shared- 
memory multiprocessors like the Multi¬ 
max is roughly similar to what holds be¬ 
tween block-structured languages and 
stack architectures. The architecture 
strongly supports the language; the lan¬ 
guage refines the power of the architecture 
and makes it accessible to programmers. 
Of course, for all its promise, the Encore 
doesn’t end our interest in networks like 
the S/Net. Shared memory seems ideal for 
small or medium-sized collections of pro¬ 
cessors. S/Net-like architectures, par¬ 
ticularly the Linda machine we describe 
below, may well scale upwards to enor¬ 
mous sizes. 

We have referred to Linda as a pro¬ 
gramming language, but it really isn’t. It is 
a new machine model, in the same sense in 
which dataflow or graph-reduction may 
be regarded as machine models as much as 
programming methodologies. The kernels 
described above are software realizations 
of a Linda machine, but Ahuja and 
Venkatesh Krishnaswamy of Yale are 
designing a hardware Linda machine as 
well, based on the S/Net. The heart of the 
Linda machine is a box to be interposed 
between each processor and the S/Net 
bus. The box implements the Linda com¬ 
munication kernel in hardware, turning an 
ordinary bus into a tuple space. The cur¬ 
rent box is designed for the S/Net ex¬ 
clusively, but we are interested in general 
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versions that will connect arbitrary nodes 
and communication media as well. In¬ 
stallation of either the software Linda 
kernel or the hardware Linda boxes has 
the effect of uniting many physically- 
disjoint nodes into one logically-shared 
space. 


Friends 

Linda may be regarded as machine lan¬ 
guage for the Linda machine. We can in 
fact compile higher-level parallel lan¬ 
guages into Linda. Higher-level languages 
may, for example, support shared vari¬ 
ables that are directly accessible to parallel 
processes. If V is a shared variable, the 
compiler might translate 
v: = expr 
to 

in(“v”, formal v_value); 
out(“v”, expr) 
and 

to 

read (“v”, formal v_value); 
/(...v_value,...) 

We can support data objects like streams 
on top of Linda in the same general way. 

The higher-level parallel languages that 
interest us particularly are the so-called 
symmetric languages. Symmetric lan¬ 
guages are based on the proposition that, 
just as we can give names to arbitrary 
statement sequences and nest their execu¬ 
tion in arbitrary ways, we should be able to 
name arbitrary horizontal combinations, 
or environments, and nest them in ar¬ 
bitrary ways. 

Consider an arbitrary “execute sequen¬ 
tially” statement: 
si; s2;...; sn 

We can represent the execution of this se¬ 
quence at runtime as as in Figure 8. Each 
box represents the execution of one state¬ 
ment in the sequence. The boxes are 
stacked on top of each other; the evalua¬ 
tions of successive elements occupy dis¬ 
joint intervals of time, but they may suc¬ 
cessively occupy the same space. (Thus if 
each si is a block that creates local 
variables, we can always reuse the previous 
block’s storage space for the next block’s 
variables, once evaluation of the previous 
block is complete.) Now suppose we trans¬ 
pose this structure around the time-space 


axis, as in Figure 9. Again, each box repre¬ 
sents the execution of a separate state¬ 
ment. In the resulting structure, which we 
refer to as an alpha, the evaluations of suc¬ 
cessive elements occupy disjoint regions of 
space and share one time; that is, they exe¬ 
cute concurrently. If we added alphas to a 
programming language and wrote them 
sl& s2& ...& sn, 

the resulting statement calls for the execu¬ 
tion of all si in parallel. 

Suppose we add one more element to 
the alpha’s definition. In most program¬ 
ming languages, a local variable’s scope is 
specified explicitly and its lifetime is in¬ 
ferred from its scope by the following sim¬ 
ple rule: A variable must live for at least as 
long as the statements that refer to it, so 
that they may be assured of finding it when 
they look. In symmetric languages we re¬ 
verse this rule and infer scope from life¬ 
time: If a variable is guaranteed to live for 
at least as long as a group of statements, 
then those statements may refer to it 
because they are assured of finding it when 
they look. Now consider the alpha: Execu¬ 
tion of an alpha as a whole can’t be com¬ 
plete until each of its components has exe¬ 
cuted to completion. (The same rule holds 
for the standard “execute sequentially” 
form.) Because alpha execution isn’t com¬ 
plete until every component has been fully 
evaluated, no box in the alpha representa¬ 
tion above will disappear until they all do. 
It follows that, if we store a named 
variable instead of an executable state¬ 
ment inside some box, then that variable 
should be accessible to statements in adja¬ 
cent boxes, because the variable and the 
statement live for the same interval of 
time. The statement is therefore assured of 
finding the variable when it looks for it. 
Hence, symmetric languages will use 
alphas to create blocks as well as to create 
parallel-execution streams. For example, 
the Pascal block 

var i: real; j: integer; begin ... end 
becomes 

i: real& j: integer& begin ... end 
in Symmetric Pascal. 

The alpha can in fact be used as a flexi¬ 
ble computational cupboard. We can 
store any assortment of named values and 
active processes in its slots. Symmetric 
languages use alphas to serve the purpose 
of a Pascal record, of a Simula class or 
Scheme closure, of a package or a module, 


We'd like to be able 
to encompass whole 
networks, even 
physically-dispersed 
ones, within Linda 
systems. 


and in fact of an entire program or en¬ 
vironment. All symmetric languages natu¬ 
rally encompass interpreted as well as 
compiled execution. A symmetric-language 
interpreter simply builds an alpha in¬ 
crementally, repeatedly tacking on new 
elements at the end. This incrementally- 
growing alpha is the interpreter’s environ¬ 
ment. Because the elements of an alpha 
may be evaluated in parallel, the sym¬ 
metric interpreter is a parallel interpreter: 
Each new expression the user enters is 
evaluated in a separate process, concur¬ 
rently with all previous expressions. The 
values returned by all these concurrent 
evaluations coalesce into a single shared 
naming environment. 

This is a mere sketch. Symmetric lan¬ 
guages are discussed in detail elsewhere. 4 
We are particularly interested in Sym¬ 
metric C and Symmetric Lisp; either may 
be implemented on top of the runtime en¬ 
vironment provided by the Linda kernel. 

The future 

We have many future plans. 

The semantics of a tuple space allow it, 
like a file, to exist independently of any 
particular process or program. A tuple 
space might in the abstract outlive many 
invocations of the same program. What 
we’d like, then, is for tuple spaces to be 
regarded as a special sort of file (or 
equivalently, for files to be special tuple 
spaces). We’d like to be able to keep tuple 
spaces along with files in hierarchical 
directories. With many tuple spaces to 
choose from, Linda processes must be 
given a way to indicate which one is the 
current one. Once some such mechanism 
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has been provided, the availability of 
multiple tuple spaces greatly expands the 
system’s capabilities. We can associate dif¬ 
ferent protection attributes with different 
tuple spaces, just as we do with different 
files. We can use tuple spaces to support 
communication between user and system 
processes by making tuple spaces available 
for out only, read only, and so on. We can 
also allow in operations that remove whole 
tuple spaces at one blow and outs that add 
whole tuple spaces. The design and im¬ 
plementation of such an extension is a goal 
for the immediate future. 

It’s clear that Linda can be an inter¬ 
preted command language as well as a 
compiled one. It would be useful to allow 
users to add, read, and remove tuples from 
active tuple spaces interactively. We’ve 
taken some preliminary steps towards im¬ 
plementing such a system. We’d like, too, 
to be able to encompass whole networks, 
even physically-dispersed ones, within 
Linda systems. We can then use Linda to 
write distributed network utilities like 
mailers and file systems. Our work on the 
VAX-network implementation is leading 
toward experiments of this sort. 

F inally, we imagine, as an object of 
prime interest for the future, an 
enormous Linda machine highly op¬ 
timized to support Linda primitives. We 
don’t yet know how to build such a ma¬ 
chine, but it’s hard not to notice that very 
large networks with small diameters might 
be constructed out of multidimensional 
grids of S/Net-like buses. Having built 
such a machine, we imagine tuple space 
itself as the machine’s main memory. 
(Outside of tuple space, only registers and 
local caches exist.) As the Linda primitives 
become faster and more efficient, such an 
architecture looks more and more like a 
sort of dataflow machine, but with a 
crucial difference. As in a dataflow 
machine, we can create task templates 
(stored in tuples), update them with new 
values as they become available, and mark 
them “enabled” when all values are filled 
in. General-purpose evaluator processes, 
much like the replicated workers discussed 
above, use in to pick out task tuples 
marked “enabled.” But unlike the token 
space of a dataflow machine, a Linda 
machine’s tuple space can store data struc¬ 
tures as well as task descriptors. Processes 
are free to build whatever (distributed) 


data structures they want, and manipulate 
and side-effect them as they choose. We 
might even use such a Linda machine to 
store large databases operated upon in 
parallel. 

Such work is for the future. We still lack 
a polished Linda implementation on any 
machine. We hope to have one soon. And 
clearly we can learn a great deal by contin¬ 
uing to refine and to experiment with Lin¬ 
da kernels for present-generation architec¬ 
ture. This is what we plan to do.D 


References 

1. J. T. Deutsch and A. R. Newton, 
“MSplice: A Multiprocessor-based 
Circuit Simulator,” Proc. 1984 Int’l 
Conf. Parallel Processing, Aug. 1984, pp. 
207-214. 

2. G. F. Pfister et al., “The IBM Research 
Parallel Processor (RP3): Introduction 
and Architecture, ’ ’ Proc. 1985 Int 7 Conf. 
Parallel Processing, Aug. 1985. 



Sudhir Ahuja obtained his MS and PhD in elec¬ 
trical engineering from Rice University in 1974 
and 1977, respectively. He has been with AT&T 
Bell Laboratories, Holmdel, NJ since 1977. He 
is currently the head of the System Architec¬ 
tures Research Department. His earlier work 
involved associative memories, pipelining, and 
parallel processing. He has been involved in the 
design and implementation of an associative 
processor, high-speed buses, and multiproces¬ 
sor systems. His current interests are in the field 
of multiprocessor architectures, concurrent 
programming, local networking, and the use of 
VLSI to implement specialized processor archi- 


Readers may write to Gelernter at the Dept, 
of Computer Science, PO Box 2158, Yale Sta¬ 
tion, New Haven, CT 06520-2158. 


Acknowledgments 

Rob Bjornson, Venkatesh Krishna- 
swamy, and Jerry Leichter are our collab¬ 
orators in the Yale Linda group. Thanks 
also to Erik DeBenedictis, Robert Gaglia- 
nello, Howard Katseff, and Thomas Lon¬ 
don of AT&T Bell Labs. 


3. N. Carriero and D. Gelernter, “The 
S/Net’s Linda Kernel,” Proc. Symp. 
Operating System Principles, Dec. 1985, 
and ACM TOCS, May 1986. 

4. D. Gelernter, “Symmetric Programming 
Languages,” Yale Univ. Dept. Comp. 
Sci. tech, report yaleu/dcs/ rr#253, Dec. 
1984. 



Nicholas Carriero is a graduate student in the 
Yale University Department of Computer 
Science. He received a BS from Brown in 1980 
and an MS in computer science from SUNY at 
Stony Brook in 1983. Distributed programming 
languages and operating systems are his re¬ 
search interests. 


David Gelernter’s biography and photo appear 
following the Guest Editor’s Introduction, on 
page 16. 


34 


COMPUTER 













Domesticating Parallelism 



Parallel Symbolic 
Computing 


Robert H. Halstead, Jr, 
Massachusetts Institute of Technology 



Futures find 
parallelism in symbolic 
programs by allowing 
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P rograms differ from one another in 
many dimensions. In one such di¬ 
mension, programs can be laid out 
along a spectrum with predominantly 
symbolic programs at one end and pre¬ 
dominantly numerical programs at the 
other. The differences between numerical 
and symbolic programs suggest different 
approaches to parallel processing. This 
article explores the problems and oppor¬ 
tunities of parallel symbolic computing 
and describes the language Multilisp, used 
at M.I.T. for experiments in parallel 
symbolic programming. 

Numerical versus 
symbolic computation 

Much of the attention focused on par¬ 
allel processing has concerned numerical 
applications. High-performance numer¬ 
ical computers have been designed using 
varying degrees of concurrency. Pro¬ 
gramming tools for these computers 
range from compilers that automatically 
identify concurrency in Fortran pro¬ 
grams to languages featuring explicit 
parallelism following a communicating- 
sequential-processes 1 model. Numerical 
computation emphasizes arithmetic. The 
principal function of a numerical program 
may be described as delivering numbers to 
an arithmetic unit to calculate a result. 
Numerical programs generally have a 
relatively data-independent flow of con¬ 
trol. Within broad limits, the same se¬ 
quence of calculations will be performed 
no matter what the operand values are. In¬ 
ner loops of numerical programs may con¬ 
tain conditionals, and overall control of a 
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program generally includes tests of con¬ 
vergence criteria and such, but most 
numerical programs have a relatively pre¬ 
dictable control sequence when compared 
with the majority of symbolic programs. 
Matrices and vectors are common data 
structures in numerical programs, a fact ex¬ 
ploited by single-instruction stream, multi¬ 
ple data stream (SIMD) techniques in many 
numerically-oriented supercomputers. 

In contrast, symbolic computation em¬ 
phasizes rearrangement of data. Partly 
because of this, heavily symbolic pro¬ 
grams are more likely to be written in a 
language such as Lisp 2 or Smalltalk than 
in Fortran. The principal function of a 
symbolic program may be broadly stated 
as the reorganization of a set of data so 
that the relevant information in it is more 
useful or easier to extract. Examples of 
primarily symbolic algorithms include 
sorting, compiling, database manage¬ 
ment, symbolic algebra, expert systems, 
and other artificial intelligence applica¬ 
tions. The sequence of operations in sym¬ 
bolic programs is often highly data depen¬ 
dent and less amenable to compile-time 
analysis than in numerical computation. 
Moreover, there does not appear to be any 
simple operation style, comparable to vec¬ 
tor operations in numerical programs, 
that can easily be exploited to increase per¬ 
formance with a SIMD type of architec¬ 
ture. Some operations, such as procedure 
calling, pointer following, and even tree 
search, occur frequently in symbolic pro¬ 
grams, but it is not obvious how SIMD 
parallelism can help efficiently with these. 

The structure of symbolic computations 
generally seems to lend itself less well to 
analysis of loops (the major focus in 

35 






parallelizing numerical computation), 
favoring recursions on composite data 
structures such as trees, lists, and sets as 
the major source of concurrency. Pro¬ 
gramming languages such as QLisp 3 and 
Multilisp 4 ' 5 include constructs to take ad¬ 
vantage of these sources of concurrency. 

Overview of Multilisp 

Multilisp is a version of the Lisp-like 
programming language Scheme 2 ex¬ 
tended to allow the programmer to specify 
concurrent execution. Multilisp shares 
with Scheme two properties that distin¬ 
guish them from the more common mem¬ 
bers of the Lisp family. The first is ex¬ 
clusive reliance on lexical scoping, which 
promotes modularity. The second is 
“first-class citizenship” for procedures: 
Procedures in Scheme and Multilisp may 
be passed freely as arguments, returned as 
values of other procedures, stored in data 
structures, and treated in the same way as 
any other kind of value. 

Multilisp includes the usual Lisp side- 
effect primitives for altering data struc¬ 
tures and changing the values of variables. 
Therefore, control sequencing beyond 
that imposed by explicit data dependencies 
may be required in order to assure deter¬ 
minate execution. In this respect, Multilisp 
parts company with many concurrent Lisp 
languages, 6 ’ 7 which include only a side- 
effect-free subset of Lisp. 

The default in Multilisp is sequential 
execution. This allows Lisp programs or 
subprograms written without attention to 
parallelism to run, albeit without using the 


potential concurrency of the target ma¬ 
chine. Concurrency can be introduced 
into a Multilisp program by means of the 
future construct. The form (future X) im¬ 
mediately returns a future 8 for the value 
of X and creates a task to concurrently 
evaluate X, allowing concurrency between 
the computation of a value and the use of 
that value. When the evaluation of X 
yields a value, that value replaces the 
future. We say that the future resolves to 
the value. Any task that needs to know a 
future’s value will be suspended until the 
future is resolved. 

A task T examines, or touches, a future 
when it performs an operation that causes 
T to be suspended if the future is not yet 
resolved. Most operations, such as arith¬ 
metic, comparison, and type checking, 
touch their operands. (Any operation that 
is strict in an operand touches that oper¬ 
and.) However, simple transmission of a 
value from one place to another, such as 
by assignment, passing as a parameter to a 
procedure, returning as a result from a 
procedure, or building the value into a 
data structure, does not touch the value. 
Thus, many things can be done with a 
future without waiting for its value. 

In Multilisp, future is the only primitive 
for creating a task. There is a one-to-one 
correspondence between tasks and the 
futures whose values they were created to 
compute. Every task ends by resolving its 
associated future to some value. 

future is related to the idea of lazy 
evaluation, often used in designs for 
graph-reduction architectures. 7,9 In lazy 
evaluation, an expression is not evaluated 
until its value is demanded by some other 


part of a computation. When an expres¬ 
sion is encountered in a program, it is not 
evaluated immediately. Instead, a suspen¬ 
sion is created and returned, and evalua¬ 
tion of the expression is delayed until the 
suspension is touched (in the Multilisp 
sense). A suspension is much like a future; 
the only difference between future and 
lazy evaluation is that future does not wait 
for the suspension to be touched before 
beginning evaluation of the expression. 
Multilisp has a delay primitive that im¬ 
plements lazy evaluation exactly (it returns 
a future and does not begin evaluation of 
the expression until the future is touched), 
but delay by itself does not express any 
concurrency. 

Although future induces some patterns 
reminiscent of those found in graph- 
reduction architectures, in other ways 
future creates a style of computation much 
like that found in data flow architectures. 10 
Every task suspended waiting for a future 
to resolve is like a data flow operator 
waiting for an operand to arrive. As in the 
case of data flow, each such task becomes 
eligible to proceed as soon as all its 
operands become available. When a task 
proceeds, it will eventually resolve another 
future, reactivating other suspended tasks 
in a pattern very reminiscent of the flow of 
data tokens in a data flow graph. Futures 
thus offer access to an interesting mixture 
of styles of parallel computation. 

An example program in 
Multilisp 

To get an idea of what programming 
with futures is like, consider an example 
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Multilisp program that manipulates sets 
represented as binary trees. The precise 
nature of the elements of these sets is not 
important—they could be integers, 
ordered pairs, character strings, or what¬ 
ever—but assume that they are totally 
ordered by the Lisp predicate elt<, so 
(elt< A B) returns true if and only if the 
element A precedes the element B in the 
total order. The existence of the total 
order allows us to arrange the TV elements 
of a set for lookup in O(logN) time by any 
of a variety of well known techniques. 
Possible uses of such a set are to collect in¬ 
tegers or character strings with certain 
properties in common, or to record pairs 
of values from the domain and range of 
some function. 

Our example program uses binary trees 
built out of nodes as suggested by Figure 1. 
Each leaf node of a tree is an actual set ele¬ 
ment; each interior node is a triple, as 
shown in Figure la. A Lisp function leaf? 
distinguishes between the two types of 
nodes: (leaf? X) returns true if A is a leaf 
node and false if X is an interior node. 
Each interior node has left and right 
children that are other nodes, plus a 
discriminant equal to the largest element 
stored in the left subtree of that node, as 
shown in Figure lb. A Lisp function 
(make-node L D R) makes and returns a 
new interior node whose left child is L, 
whose discriminant is D, and whose right 
child is R. Given an interior node TV, (left- 
child TV), (discriminant TV), and (right- 
child TV) return, respectively, the left child, 
discriminant, and right child of TV. 

A Multilisp procedure to insert an ele¬ 
ment elt into a tree tree is shown in Figure 
2. This is a nondestructive insert; it copies 
the tree nodes to be modified and returns a 
new tree rather than performing side ef¬ 
fects on existing nodes. Except for its two 
uses of future, Figure 2 is the straightfor¬ 
ward Lisp procedure for insertion into this 
kind of tree. The case of inserting into an 
initially empty tree needs special treat¬ 
ment. In this case, insert just returns elt (a 
single leaf node) as the resulting tree. If 
tree is not empty, then it may be a leaf or 
an interior node. If it is a leaf, insert 
returns an interior node with elt and tree as 
children in the proper order. If tree is an 
interior node, insert determines whether 
elt belongs in the left or right subtree of 
tree and returns a new interior node with 
the same discriminant and suitable left and 
right children. 


(defun insert (elt tree) 

(if (empty-tree? tree) 
elt 

(if (leaf? tree) 

(if (elt< tree elt) 

(make-node tree tree elt) 

(make-node elt elt tree)) 

(if (elt < (discriminant tree) elt) 

(make-node (left-child tree) 

(discriminant tree) 

(future (insert elt (right-child tree)))) 
(make-node (future (insert elt (left-child tree))) 
(discriminant tree) 

(right-child tree)))))) 


Figure 2. Insert routine for 
search trees, using future. 

In addition to the pro¬ 
cedures discussed in the 
text, this program uses two 
standard Lisp special forms, 
(defun f(v 1 v 2 ... ) body) 
defines a procedure f whose 
formal parameters are v^, 
v 2 , ... , and whose value is 
the value of the expression 
body, (if XYZ) returns the 
value of Y if X evaluates to 
true; otherwise the value of 
Z is returned. 


The use of future in Figure 2 allows in¬ 
sert to return even before the insertion has 
completed. If future were not used, then 
an insert applied to an interior node would 
not return until its recursive call to insert 
had returned. Thus the new tree would be 
constructed in a bottom-up order and no 
result would be returned until the new tree 
had been completely constructed. Using 
future, however, insert can construct a 
new node and return it without waiting for 
completion of recursive calls to insert. If 
tree is an interior node, insert makes a new 
node that points to a future that will 
resolve to the value of the recursive call to 
insert. Consequently, the result of an in¬ 
sertion develops in more of a top-down 
fashion, as shown in Figure 3. 

In order to insert three new elements A, 
B, and Cinto some tree T, we could write 
(insert C (insert B (insert AT))) 

A naive analysis might conclude that no 
concurrency is available in this expression, 
due to data dependencies (the insertion of 
B requires the result of inserting A, and so 
on). Yet with futures we find that some 
concurrency is available. For example, if T 
were the tree of Figure lb and A, B, and C 
were 7,4, and 29, respectively, then the in¬ 
sert of B could begin as soon as the insert 
of A returns, and the insert of Cneed only 
await the return of the insert of B and the 
determination of the first future created 
during A 's insertion. The remaining work 
for the three insertions can then proceed in 
parallel. 

The fallacy in the naive analysis is in 
treating structured values such as binary 
trees as indivisible units. In fact, many (if 
not most) operations on structured values 


require only partial information about 
their operands. Futures give us a way to 
represent partially computed values, so 
they can be released for use while they are 
still being computed. As illustrated by the 
example, this can expose concurrency not 
easily accessible using conventional fork- 
join control structures. This is especially 
significant in symbolic computing, where 
operations on structured data are the 
norm, and where opportunities to use the 
well-known loop and flow analysis tech¬ 
niques are often much more limited than 
in the case of numerical computing. 

It is of course possible to select values 
for A, B, and C above such that futures 
will yield relatively little concurrency (such 
as A = 7, B = 9, and C=10). In this case 
the lack of concurrency results from real 
data dependencies: all three insertions are 
operating in the same region of the tree. 
Even in this case there will be some concur¬ 
rency as the insertions of A, B, and C 
follow each other down the tree, but each 
insertion will be prevented from complete¬ 
ly finishing until the previous one has 
finished most of its work, future cannot 
remove actual data dependencies, but can 
remove apparent dependencies by allow¬ 
ing structured values to be computed 
piecemeal. 

This example illustrates the character of 
futures in a relatively simple setting, rather 
than illustrating the best in parallel tree 
management schemes. At the superficial 
level, we can certainly extend the program 
in Figure 2 to cope with insertion of an 
item already in the tree. We can also add 
the delete and lookup routines desired in 
many applications. 
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Figure 3. Top-down behavior of insert 
using future, (a) through (d) depict suc¬ 
cessive stages resulting from inserting 7 
into the tree shown in Figure 1b. (a) 
shows the initial value returned, while (d) 
shows the final value after all futures 
have been resolved. The cloud-like 
shapes represent unresolved futures. 


More serious for some applications than 
the lack of these features is the fact that the 
program does not necessarily produce 
balanced search trees. If insertions are per¬ 
formed in some unfortunate order (such as 
strictly increasing), the result of inserting 
K items can be a tree of depth K that takes 
0(K 2 ) time to build. 

Various schemes exist for building bal¬ 
anced search trees, such as 2-3 trees and 
AVL trees. 11 These schemes can be 
adapted using future to yield parallelism, 
but the results can be disappointing. The 
reasons are instructive. Balanced tree 
schemes generally try to maintain rough 
equality at each node N between the 
number of nodes in the subtrees headed by 
the children of N. The common sequential 
algorithms for insertion into balanced 
trees first find the place where the new leaf 
node will be added, then work back up the 
tree toward the root, rearranging the 
structure as needed to ensure the proper 
balance. Thus, the final shape of the tree, 
even near the root, may be determined 
only after the location of the new leaf node 
has been determined. 

Unfortunately, the concurrency in the 


program in Figure 2 comes from releasing 
information about the structure of the 
resulting tree even before an insertion has 
progressed all the way to a leaf node. 
Every recursive invocation of insert ends 
by specifying the left and right children of 
some newly created node. One of the 
children may be specified as a future, so its 
value may not be known, but its identity is 
known. In the case of both 2-3 trees and 
AVL trees, at certain points it can be seen 
that any tree reorganization resulting from 
an insertion cannot propagate above that 
point. But at other points, the identities of 
the child nodes cannot be fixed without 
additional information generated as the 
insertion progresses. At such points, con¬ 
struction of a new node must be delayed 
until it becomes clear what its children 
should be. This delays the top-down 
evolution suggested by Figure 3 and 
therefore reduces the opportunities for 
parallelism. Balanced-tree schemes are not 
necessarily unsuitable for parallel execu¬ 
tion, but we need different algorithms that 
operate in a more top-down manner, per¬ 
haps by accepting a more relaxed standard 
of balance for search trees. 


Parallel programming in 
the large 

Although our tree-insertion program is 
a simple example, it shows how futures 
can be used to expose concurrency in deal¬ 
ing with composite data structures by pro¬ 
viding a representation for partially com¬ 
puted data. It also shows the importance 
of algorithms able to release partial infor¬ 
mation about their results as soon as possi¬ 
ble. However, applications for parallel 
computing are generally large programs, 
not 20-line programs such as Quicksort or 
tree insertion. Therefore, a useful system 
for applying parallel computation to real 
problems must support powerful ways of 
combining pieces together into large pro¬ 
grams, not just techniques for making 
small subprograms use concurrency. Con¬ 
structs such as future, which help us easily 
glue programs together in concurrent 
ways, are only part of what we need. We 
also need adequate control over the alloca¬ 
tion of resources (notably processors) to 
the execution of various parts of a pro¬ 
gram. In effect, we need control over the 
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focus of attention of a parallel computer 
as it executes a program. 

Other aids for the development and 
structuring of large programs include sup¬ 
port for debugging, exception handling, 
data abstractions, and atomic modifica¬ 
tion of mutable data objects. Except for 
the last item, all of these are also important 
in sequential programming, but parallel 
programming brings new dimensions to 
them. 

Scheduling. Algorithms may have op¬ 
portunities for concurrency at any of sev¬ 
eral levels of granularity, ranging from 
short sequences of primitive operations to 
large program modules. These opportuni¬ 
ties are multiplicative. If the application of 
medium- or fine-grain parallelism within a 
module is sufficient to occupy m proces¬ 
sors and n of these modules can be exe¬ 
cuted in parallel, then mn processors can 
be used efficiently to execute the program 
as a whole (unless contention for shared 
resources imposes a smaller limit). Thus, 
we should exploit opportunities for con¬ 
currency at all levels if we desire execution 
on a highly parallel machine. 


Concurrency, however, arises from a 
variety of program structures. One way to 
write a concurrent program is to start with 
a suitably chosen sequential program and 
then relax some of the precedence con¬ 
straints in that program to produce oppor¬ 
tunities for executing some operations 
concurrently. This mandatory work style 
of parallel programming is supported by 
future, as well as by the fork-join con¬ 
structs of other languages. A concurrent 
program written in the mandatory work 
style executes precisely the same set of op¬ 
erations as its sequential counterpart; only 
the scheduling of operations is different. 
Language constructs, such as future and 
fork-join, may differ in their effective¬ 
ness at relaxing precedence constraints 
and may therefore be more or less useful 
in support of the mandatory work style, 
but the basic equivalence between the sets 
of operations performed by sequential 
and mandatory work parallel programs 
remains. 

The mandatory work style contrasts 
with the speculative approach, where par¬ 
allelism is obtained by eagerly spawning 
tasks before it is certain that their values 


will be needed. A characteristic particular¬ 
ly prevalent in artificial intelligence pro¬ 
grams, but also found elsewhere, is the 
existence of multiple techniques to solve a 
class of problems. For any particular 
problem, some of the techniques may 
work very quickly, while others may fail 
altogether. We therefore desire the ability 
to start using several techniques in parallel 
and to terminate execution of the others 
when one of the techniques produces an 
answer. 

There are many opportunities for spec¬ 
ulative parallelism outside the domain of 
artificial intelligence, especially in search¬ 
ing problems. An example is the use of 
branch-and-bound techniques for prob¬ 
lems such as the traveling salesman prob¬ 
lem. A simple example of speculative par¬ 
allelism from yet another domain is the 
Fermat test to see if a number is prime. 
The Fermat test is based on the observa¬ 
tion that, if p is prime, then a p ~ 1 mod p 
= 1 whenever 0 < a < p. Furthermore, if 
p is not prime, then a p ~ 1 mod p has a cer¬ 
tain probability of not being 1 when a is 
chosen randomly from 0 < a < p. The 
primality of a number p can be tested by 
randomly selecting several values a from 0 
< a < p and evaluating a p ~ I mod p for 
each a. If all evaluations yield 1, then with 
high probability p is prime. Otherwise, p is 
certainly not prime. (This test has a high 
probability of being fooled for certain 
pathological non-prime numbers p. 
Other, more sophisticated tests on a and p 
yield more reliable results, but the overall 
organization, and probabilistic nature, of 
the algorithm remain the same.) 

A sequential program for the Fermat 
test would select one a after another until 
either a sufficient number of e’s have been 
tested or a p_1 mod p ^ 1 for one of the 
choices of a. One strategy for using paral¬ 
lelism in the Fermat test would be to test 
several different a’s concurrently. In the 
case where p is prime, this would make the 
test go much faster, but if p is not prime, it 
might be that a p mod p ^ 1 for the very 
first a. Then the work done on the other 
a’s would be wasted and any remaining 
work associated with testing the primality 
of p should be cancelled. Therefore, if 
there is other mandatory work to do, the 
successive e’s should be tested sequential¬ 
ly, but if processing capacity would other¬ 
wise go idle, then the a’s might as well be 
tested in parallel. 
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To exploit 
concurrency at all 
levels and from all 
sources, both 
mandatory work and 
speculative 
parallelism should 
be used. 


Compared to speculative parallelism, 
mandatory work parallelism is especially 
nice because scheduling is less critical and, 
except for process management overhead, 
no extra operations are performed during 
parallel execution. Assuming the original 
sequential algorithm is efficient, the man¬ 
datory work approach represents a kind of 
lower bound. It may be possible to in¬ 
crease concurrency beyond that available 
in a mandatory work program by adding 
speculative operations, but these opera¬ 
tions represent an overhead justified only 
if the increase in parallelism outweighs the 
extra work done. 

Scheduling of mandatory operations is 
not very critical because all mandatory 
operations must be done eventually, so 
any mandatory operation ready to be per¬ 
formed may be executed with assurance 
that it will not be wasted work. In fact, the 
Multilisp implementation uses an unfair 
scheduler, which is perfectly legal in the 
case of mandatory work and helps solve 
some resource allocation problems 4 very 
hard to solve using a fair scheduler. (For 
utmost efficiency, mandatory operations 
on the critical path of a computation 
should be treated as more mandatory than 
others and always included in the set that 
gets scheduled. In practice, the precedence 
graphs of most parallel programs seem to 
have a “bushy” enough structure that this 
is not a major concern.) 

Scheduling is much more critical in the 
presence of speculative parallelism. Usual¬ 
ly, some speculative tasks have a higher 
potential payoff than others. Low-payoff 
speculative tasks should not be executed in 
preference to high-payoff speculative 
tasks or mandatory tasks. The only time to 
execute speculative tasks is when process¬ 


ing resources would otherwise go idle; they 
should not take resources away from more 
important tasks. 

To exploit concurrency at all levels and 
from all sources, both mandatory work 
and speculative parallelism should be 
used. Therefore, both need to be sup¬ 
ported by parallel programming lan¬ 
guages. However, tools for expressing 
speculative parallelism cannot replace 
good constructs for mandatory work par¬ 
allelism. The latter is higher quality 
parallelism and should always be exploited 
as fully as possible before resorting to 
speculative parallelism. 

Multilisp’s future construct is a fairly ef¬ 
fective tool for exposing mandatory work 
parallelism, but future does not give the 
information needed to properly schedule 
speculative tasks. One idea on how to do 
this is to associate a sponsor 12 with each 
task. The sponsor answers questions from 
the scheduler regarding the importance of 
its task relative to others. Although spon¬ 
sors are a mechanism that can be used for 
scheduling speculative tasks, the policy 
implemented by the sponsors remains an 
issue. In some cases, the scheduling of 
tasks could be dictated by associating 
numerical priorities with the tasks, but the 
general question of what tools to give the 
programmer for use in specifying the 
scheduling of speculative tasks remains an 
interesting question for research. 

Data abstractions and mutable objects. 

An important characteristic of a language 
for parallel computing is whether or not it 
allows the writing of nondeterminate pro¬ 
grams, programs that may produce dif¬ 
ferent results for different (legal) orders of 
execution. Multilisp and QLisp, by in¬ 
cluding side effects, allow nondeter¬ 
minism, while functional languages forbid 
side effects and assure determinate execu¬ 
tion. Issues relating to determinacy are 
discussed elsewhere. 2,4 A capsule sum¬ 
mary of the debate is that potentially 
nondeterminate programs can be very 
hard to debug and verify, but that the 
language restrictions needed to assure 
determinacy are substantial and rule out 
many familiar and useful program struc¬ 
tures. Multilisp allows side effects and 
hence permits these program structures. 
To help control the resulting software 
engineering problems, Multilisp supports 
side-effect-free expression of many com¬ 
putations and, through the first-class 


citizenship of Multilisp procedures, sup¬ 
ports the construction of data abstractions 
within which side effects can be compart¬ 
mentalized. This reduces their contribu¬ 
tion to program complexity. 

Correct implementation of first-class 
citizenship for procedures, in combination 
with lexical scoping, requires the use of 
garbage-collected heap storage for pro¬ 
cedural environments. Once this expense 
is incurred, however, procedures can be 
used as data abstractions. The nonlocal 
variables of a procedure, found in the lex¬ 
ically enclosing environment, can be con¬ 
sidered the underlying state variables of a 
data abstraction implemented by the pro¬ 
cedure. Operations on this data abstrac¬ 
tion can be performed by calls to the pro¬ 
cedure. As with other implementations of 
abstract data types, the underlying state 
variables can be protected from access ex¬ 
cept through the channels provided by the 
abstraction. 2 

Although it does include side effects, 
Lisp is superior to most common pro¬ 
gramming languages in that it includes a 
side-effect-free subset with substantial ex¬ 
pressive power. This subset is part of 
Multilisp; thus it is possible to write signifi¬ 
cant bodies of Multilisp code in a com¬ 
pletely side-effect-free way. Furthermore, 
where side effects are used, as in maintain¬ 
ing a changing database, they can be en¬ 
capsulated within a data abstraction that 
synchronizes concurrent operations on the 
data. The data abstraction can ensure that 
the data are only accessed according to the 
proper protocol. 

Multilisp thus supports a programming 
syle in which most code is written without 
side effects and data abstractions are used 
to encapsulate data on which side effects 
may be performed, to present a reasonable 
interface to the exterior. A programmer’s 
aim in using this style should be to produce 
a program whose side effects are compart¬ 
mentalized carefully enough that any 
module may safely be invoked in parallel 
with any other. If this style is followed, the 
difficulties caused by the presence of side 
effects will be isolated to small regions of 
the program and should therefore be 
reduced to manageable proportions. 

Debugging and exception handling. 

Debugging and exception handling are 
closely related topics. The need for debug¬ 
ging is often revealed by the occurrence of 
some runtime exception not anticipated by 
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the programmer. At times, however, a 
programmer may anticipate the possible 
occurrence of an exception such as an end- 
of-file on read or an attempt to divide by 
zero. In such cases, the ability to flexibly 
specify a handler for an exception can be 
an important program structuring tool. 

The occurrence of exceptions in a par¬ 
allel environment presents an interesting 
problem. If a task has been created to 
calculate the value of some future and is 
unable to complete due to the occurrence 
of an exception, what value should be 
given to the future? Before the occurrence 
of the exception, the procedure that 
created the future may have returned and 
the future itself may have been distributed 
to many other tasks, some of which may 
already have become suspended waiting 
for the future’s value. If the occurrence of 
an exception causes the future never to 
receive a value, then these tasks will never 
resume execution and nontermination of 
the program containing them is a likely 
result. On the other hand, what mean¬ 
ingful value do we give a future created by 
an expression such as (future (/ 3 0)), 
which has been asked to perform a divi¬ 
sion by zero? Multilisp’s solution to this 
problem involves error values . 5 If the 
evaluation of an expression A" in (future X) 
cannot complete normally, the future can 
resolve to an error value. An exception will 
be raised in any task that touches an error 
value. In this way, the consequences of an 
exception that occurs while calculating a 
value propagate back through all users of 
the value. This propagation mirrors the 
popping of stack frames that occurs when 
unwinding a sequential computation after 
an exception. 

Program debugging also takes on new 
dimensions in a parallel environment. 
Broadly speaking, debugging is concerned 
with two properties of programs: correct¬ 
ness and performance. Correctness rightly 
takes first place. A fast program is of little 
use if it produces incorrect results. Never¬ 
theless, the reason for using parallel pro¬ 
cessing in the first place is to improve 
performance, so it is important for a pro¬ 
grammer to be able to find and remove any 
obstacles to maximum performance (such 
as unnecessary data dependencies). To do 
this, programmers need good tools for 
visualizing the operation of concurrent 
programs. The factors affecting the per¬ 
formance of a program in a parallel en¬ 


vironment are more complex and subtle 
than in the sequential case; thus debugging 
for performance should be taken less for 
granted in parallel programming. 


Experience with Multilisp 

A complete Muitilisp implementation 
including a parallel, incremental garbage 
collector 4 exists on Concert, 13 an ex¬ 
perimental multiprocessor under con¬ 
struction in the author’s laboratory. The 
Concert multiprocessor, when fully built, 
will comprise 32 MC68000 processors 
and a total of about 20M bytes of memo¬ 
ry. It can be described most concisely as a 
shared-memory multiprocessor, although 
its organization includes various local 
paths from processors to nearby memory 
modules. Concert thus provides a higher 
overall bandwidth between processors and 
memory if most memory accesses are 
local. 

As of this writing, the largest part of 
Concert on which it has been possible to 
test Multilisp is a 24-processor section of 
the eventual Concert machine. Multilisp 
has also been implemented on a 128-pro- 
cessor Butterfly machine, 14 but a suitably 
tuned version of Butterfly Multilisp has 
not been available in time to include mean¬ 
ingful Butterfly measurements here. 

The Concert Multilisp implementation 
uses a layer of interpretation, so it is not 
fast in absolute terms. Understanding 
Concert Multilisp performance measure¬ 
ments is also complicated by several other 
factors, such as garbage collector perfor¬ 
mance, discussed elsewhere. 4 Despite 
these complications, the measured perfor¬ 
mance of Multilisp programs on Concert 
does offer some indication of the concur¬ 
rency made available by the use of futures. 
To gauge the impact of using futures in the 
program of Figure 2, the program was 
tested on Concert by successively inserting 
long lists of random numbers into an ini¬ 
tially empty tree. The test was essentially 
to evaluate the expression 

(insert v n 

(future (insert u„_ i ... 

(future (insert uj empty-tree))...))) 

and then walk the resulting tree to wait for 
all futures to be resolved. V\, V2,---,v„ 


represent the n numbers inserted into the 
tree. The running times on Concert, using 
varying numbers of processors, are plot¬ 
ted in Figure 4a for lists of length 128,256, 
and 512. In each case, the performance of 
the parallel program with futures is plotted 
along with that of a sequential program 
just like the parallel program except that 
all instances of the future operator have 
been removed. For comparison. Figure 4b 
shows the performance on Concert of a 
Quicksort routine using futures, 4 along 
with figures for the corresponding sequen¬ 
tial Quicksort. 

Unfortunately, futures are quite expen¬ 
sive in the current Multilisp implementa¬ 
tion, causing insert with futures to take 
about twice as long as insert without 
futures, future has been measured as tak¬ 
ing about four times as long as a procedure 
call. Some ideas that promise to make 
future considerably cheaper are currently 
under investigation at M.I.T. 

Even though each data point in Figure 4 
is the average of many trials, the curves are 
somewhat uneven and’have several pecu¬ 
liar features. Some of the variations are 
due to the garbage collector and also to the 
fact that some sequences of random 
numbers yield more nearly balanced trees 
than others. Other features are caused by 
lack of parallelism in the programs being 
measured, while yet others may be due to 
bus contention or other effects rooted in 
the implementation. Unfortunately, the 
data are quire recent and it is not yet clear 
what causes all of the features in these 
graphs. Furthermore, space limitations 
preclude a full explanation here of even 
those features of the graphs whose causes 
are understood. 

It is clearly too early to judge the ulti¬ 
mate success of the programming lan¬ 
guage ideas embodied in Multilisp. Never¬ 
theless, it is clear both that a substantial 
amount of concurrency can be exploited 
and that the speedup due to concurrency, 
for these examples, is limited. Although 
Figure 4 leaves many questions unan¬ 
swered, at least it shows that, for small 
numbers of processors, the promise of 
futures can be fulfilled. The challenge 
ahead is to remove the bottlenecks ap¬ 
parent in Figure 4, show good speed-ups 
on larger numbers of processors, and in¬ 
crease the Multilisp performance of the in¬ 
dividual processors to more competitive 
levels. 
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Figure 4. Performance measurements for 
Concert Multilisp. The performance of 
the tree insertion procedure is shown in 
(a) and that of the Quicksort procedure in 
(b). The dashed lines are the curves of 
linear speedup. 


Directions for research 
in parallel symbolic 
computing 

The critical research questions in paral¬ 
lel symbolic computing may be grouped 
into three areas: 

(1) programming languages and pro¬ 
gramming environments, 

(2) algorithm and application develop¬ 
ment, and 

(3) implementation and architecture. 
The first area encompasses most of the 
questions addressed in this article. A pro¬ 
gramming language for parallel symbolic 
computing must support both the man¬ 
datory work and speculative flavors of 
parallelism. Its associated programming 
environment must include debugging tools 
to help in finding both correctness and 
performance bugs. Furthermore, the lan¬ 
guage must support the construction of 
reasonably modular programs and the 
constructs for obtaining concurrency must 
fit neatly within the modular structure of 
programs. A final major decision point in 
language design concerns the degree of 
nondeterminacy that can exist in program 
behavior. 



The development of programming lan¬ 
guages and environments should always 
be guided by the needs of application pro¬ 
grams. Since there are not many parallel 
symbolic application programs in exis¬ 
tence, research in programming languages 
for parallel machines must be comple¬ 
mented by the development of parallel ap¬ 
plication programs. Some of these pro¬ 
grams should be of substantial size. 
Although “toy” programs such as the in¬ 
sert example of the program in Figure 2 
promote insight into parallel language 
constructs, the ultimate application of 
parallel computers will be to much more 
complex programs, and it is well known 
that the engineering of large programs is 
qualitatively different from that of small 
programs. Language research and ap¬ 
plication development can reinforce each 
other. Language ideas can suggest new ap¬ 


plication programming strategies, and the 
requirements of application programs can 
suggest areas where language design deci¬ 
sions should be re-examined. 

B oth language design and application 
program requirements should in¬ 
fluence the architecture of systems 
for parallel computing. Language con¬ 
structs may require clever implementation 
algorithms and/or special hardware sup¬ 
port for efficient execution. Application 
programs provide the invaluable service of 
helping focus the design of implementa¬ 
tions and architectures by indicating how 
much effect the efficient implementation 
of each language feature has on bottom- 
line performance. Thus, application pro¬ 
grams are not just useful for calibrating 
language design decisions, they also serve 
an important role as benchmarks to help 


evaluate proposed architectures. 

Of course, all these aspects of design 
interact. The art of the desirable in lan¬ 
guage design must be balanced against the 
art of the possible. Also, as we learn more 
about parallel computing, old decisions 
will be seen in new lights and sometimes 
modified. Nevertheless, we must make 
substantial progress on language defini¬ 
tion and application development before 
we can have any very solid objective 
grounds for evaluating proposed architec¬ 
tures for parallel symbolic computing. 
Developing the grounds for such evalua¬ 
tion is the principal research goal of the 
Multilisp project. □ 
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A process-oriented 
language. Concurrent 
Prolog embodies 
dataflow 

synchronization and 
guarded-command 
indeterminacy as its 
basic control 
mechanisms. 


C oncurrent Prolog is a logic pro¬ 
gramming language designed for 
concurrent programming and par¬ 
allel execution. A process-oriented 
language, it embodies dataflow synchroni¬ 
zation and guarded-command indetermi¬ 
nacy as its basic control mechanisms. 

This article outlines the basic concepts 
and definition of the language, and sur¬ 
veys the major programming techniques 
that emerged out of three years of its use. 
The history of the language development, 
implementation, and applications are 
reviewed. 


Orientation 

Logic programming is based on an 
abstract computation model derived by 
Kowalski 1 from Robinson’s resolution 
principle. 2 A logic program is a set of 
axioms defining relationships between ob¬ 
jects. A computation of a logic program is 
a proof of a goal statement from the ax¬ 
ioms. Because the proof is constructive, it 
provides values for goal variables, which 
constitute the output of the computation. 

Figure 1 shows the relationships be¬ 
tween the abstract computation model of 
logic programming and two concrete pro¬ 
gramming languages based on it: Prolog 
and Concurrent Prolog. It shows that Pro¬ 
log programs are logic programs aug¬ 
mented with a control mechanism based 
on sequential search with backtracking; 
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Concurrent Prolog’s control is based on 
guarded-command indeterminacy and 
dataflow synchronization. The execution 
model of Prolog is implemented using a 
stack of goals, which behave like pro¬ 
cedural calls. Concurrent Prolog’s com¬ 
putation model is implemented using a 
queue of goals, which behave like 
processes. 

Figure 2 argues for a homomorphism 
between von Neumann and logic, sequen¬ 
tial and concurrent languages. That is, it 
claims that the relationship between Oc¬ 
cam 3 and Concurrent Prolog is similar to 
the relationship between Pascal and Pro¬ 
log, and that the relationship between Pas¬ 
cal and Occam is similar to the relationship 
between Prolog and Concurrent Prolog. 
Some of the attributes in the figure are 
schematic and shouldn’t be taken literally. 
For example, Pascal has recursion, but its 
basic repetitive construct, as in Occam, is 
iteration, whereas in Prolog and Concur¬ 
rent Prolog it is recursion. Similarly, Oc¬ 
cam has if-then-else, but its basic condi¬ 
tional statement, as in Concurrent Prolog, 
is the guarded command. 


Logic programs 

A logic program is a set of axioms, or 
rules, defining relationships between ob¬ 
jects. A computation of a logic program is a 
deduction of consequences of the axioms. 


COMPUTER 






Abstract model: 
Language: 


Logic Programs 

Nondeterministic goal reduction 
Unification 


Goal and clause order 
define sequential search 
and backtracking 


Concurrent Prolog 

Commit and read-only 
operators define 
guarded-command 
indeterminacy and 
dataflow synchronization 


Stack of goals Queue of goals 

f trail for backtracking + suspension 
mechanism 


Figure 1. Logic programs, Prolog, and Concurrent Prolog. 


Sequential 
stack-based 
procedure call 
parameter passing 
if-then-else/cut 

concurrent 
queue-based 
process activation 
message passing 
guarded-command/commit 


von Neumann model 
storage variables 
(mutable) 

parameter-passing, 
assignment, 
selectors, 
constructors 
explicit/static allocation 
of data/processes 


logic programs model 
logical variables 
(single assignment) 
unification 


implicit/dynamic allocation 
of data/processes 
with garbage collection 
recursion 


Figure 2. A homomorphism between von Neumann and logic, sequential and concurrent 


intersects,LI,L2)-member(X,LI), member(X,L2). 

member (X,list(X,Xs)). 

member (X,list(V,Ys))—member(X.Ys). 


Figure 3. A logic program for List intersection. 


The concepts of logic programming and 
the definition and implementation of the 
programming language Prolog date back 
to the early seventies. Earlier attempts 
were made to use Robinson’s resolution 
principle and unification algorithm 2 as 
the engine of a logic-based computation 
model. These attempts were frustrated by 
the inherent inefficiency of general resolu¬ 
tion and by the lack of a natural control 
mechanism that could be applied to it. 
Kowalski 1 found that such a control 
mechanism can be applied to a restricted 
class of logical theories, namely Horn 
Clause theories. 4 His major insight was 
that universally quantified axioms of the 
form 

A~B l ,B 2 ,...,B n n >0 
can be read both declaratively, saying that 
A is true if B\ and B 2 and...fl„ are true, 
and procedurally, saying that to prove the 
goal A (execute procedure A, solve prob¬ 
lem A), one can prove subgoals (execute 
subprocedures, solve subproblems) B\ 
and B 2 andSuch axioms are called 
definite-clauses. A logic program is a finite 
set of definite clauses. 

Figure 3 is an example of a logic pro¬ 
gram for defining list intersection. It 
assumes that lists such as [1,2,3] are 
represented by recursive terms such as 
list(l,list(2,list(3,nil))). Declaratively, its 
first axiom reads: Ais in the intersection of 
lists L1 and L2 if Ais a member of L1 and 
A" is a member of L2. Procedurally, it 
reads: to find an Ain the intersection of L1 
and L2, find an A that is a member of LI 
and is also a member of L2. 

The axioms defining member read de¬ 
claratively: Ais a member of the list whose 
first element is A. Ais a member of the list 
list(y,ys) if A is a member of Ys. (Here 
and in the following I use the convention 
that names of logical variables begin with 
an uppercase letter.) 

The difference between the various 
logic programming languages, such as se¬ 
quential Prolog, Parlog [ClGr86], Guard¬ 
ed Horn Clauses [Ued85], and Concurrent 
Prolog [Sha83], lie in the way they deduce 
consequences from such axioms. How¬ 
ever, the deduction mechanism used by all 
these languages is based on the abstract in¬ 
terpreter for logic programs, shown in 
Figure 4. 

On the face of it, the abstract interpreter 
seems nothing but a simple, nondetermin¬ 
istic reduction engine: it has a resolvent, 


which is a set of goals to reduce; it selects a 
goal from the resolvent, a unifiable clause 
from the program, and reduces the goal 
using the clause. What distinguishes this 
computation model from others is the 


logical variable and the unification pro¬ 
cedure associated with it. 

The basic computation step of the inter¬ 
preter, as well as that of Prolog and Con¬ 
current Prolog, is the unification of a goal 


August 1986 













Input: 

A logic program Pand a goal 6 

Output: 

G6, which is an instance of G proved from P, or failure. 

Algorithm: 

Initialize the resolvent to be G, the input goal. 


While the resolvent is not empty do * 

choose a goal A in the resolvent 
and a fresh copy of a clause 

A'-S 1 ,S 2 ,...,% k> 0, in P, 
such that A and A' are unifiable with a substitution 8 
"(exit if such a goal and clause do not exist). 

Remove A from, and add fl,, e 2 . B n to, the resolvent 

Apply 8 to the resolvent and to G. 


If the resolvent is empty then output G, else output failure. 

Figure 4. An abstract interpreter for logic programs. 

1) Goal=process L 

2) Conjunctive goal=network of processes 

3) Shared logical variable=communication channel =shared-memory single-assignment variable 

4) Clauses of a logic program=rules, or instructions, for process behavior. 


Figure 5. Concepts of logic programming and concurrency. 


with the head of a clause. 2 

The unification of two terms involves 
finding a substitution of values for vari¬ 
ables in the terms that make the two terms 
identical. Thus unification is a simple and 
powerful form of pattern matching. 

Unification is the basic, and only, data 
manipulation primitive in logic program¬ 
ming. Understanding logic programming 
is understanding the power of unification. 
Unification subsumes the following data- 
manipulation primitives, used in conven¬ 
tional programming languages: 

• Single-assignment (assigning a value 
to a single-assignment variable), 

• Parameter passing (binding actual 
parameters to formal parameters in a 
procedure or function call), 

• Simple testing (testing whether a 
variable equals some value, or if the 
values of two variables are the same), 

• Data access (field selectors in Pascal, 
car and cdr in Lisp), 

• Data construction {new in Pascal, 
cons in Lisp), and 

• Communication. 

The efficient implementation of a logic 
programming language involves the com¬ 


pilation of the known part of unification, 
as specified by the program’s clause heads 
to the above-mentioned set of more primi¬ 
tive operations. 5 

A term is either a variable, such as A, a 
constant, such as a or 13, or a compound 
term/(7'i ,72. T„) whose main func¬ 

tor has name /, arity n, and whose 
arguments T x , T 2 ,.... T n are terms. 

A substitution element is a pair of the 
form Variable = Term. An (idempotent) 
substitution is a finite set of substitution 

elements f V x = 7",, V 2 = T 2 . V„ = T n } 

such that Vj s* Vj if i ^ j, and V, does not 
occur in Tj for any / and j. 

The application of a substitution 8 to a 
term S, denoted S8, is the term obtained by 
replacing every occurrence of a variable V 
by the term T for every substitution ele¬ 
ment V = T in 8. Such a term is called an 
instance of S. 

For example, applying the substitution 
\X=3, Xs = list( 1 ,list(3,nil))J to the term 
member (A, list (A', As)) is the term 
member(3,list(3 ,list( 1 ,list(3 .nil)))). 

A substitution 8 unifies terms T x and 
7’ 2 if T x 6 = T 2 6. Two terms are unifiable 
if they have a unifying substitution. If two 
terms T x and T 2 are unifiable, then there 


exists a unique substitution 8 (up to renam¬ 
ing of variables), called the most general 
unifier of T x and T 2 , with the following 
property: For any other unifying substitu¬ 
tion a of T x and T 2 , T x a is an instance of 
T x 8. Hereafter, I use “unifier” as short¬ 
hand for “most general unifier.” 

For example, the unifier of X and a is 
[X=a]. The unifier of A and Tis (A= Y] 
(or [ F= A)). The unifier of f{ A,A) and 
f(A ,b) is [A= b,A=b], and the unifier of 
g(X,X) and g(a,b) does not exist. Consid¬ 
ering the example logic program above, 
the unifier of member (A ,list( 1 ,list(2,nil))) 
and member(A,list(A,As)) is (A=l, 
A — 1 ,A s = list(2,nil)J. 


Concurrent Prolog 

Concurrent programming: processes, 
communication, and synchronization. A 

concurrent programming language can ex¬ 
press concurrent activities, or processes, 
and communication among them. Pro¬ 
cesses are abstract entities, the generaliza¬ 
tion of the execution thread of sequential 
programs. The actions a process can take 
include interprocess communication, 
change of state, creation of new processes, 
and termination. 

It might seem that a declarative lan¬ 
guage, based on the logic programming 
computational model, would be unsuit¬ 
able for expressing the wide spectrum of 
actions of concurrent programs. This is 
not the case. Sequential Prolog shows 
that, in addition to its declarative reading, 
a logic program can be read procedurally. 
Concurrent Prolog shows yet another pos¬ 
sible reading of logic programs, namely 
the process behavior reading, or process 
reading for short. The essential com¬ 
ponents of concurrent computations (con¬ 
current actions, indeterminate actions, 
communication, and process creation and 
termination) are already embodied in the 
abstract computation model of logic pro¬ 
gramming; they can be uncovered using 
the process reading. 

Before introducing the computational 
model of Concurrent Prolog that em¬ 
bodies these notions, I would like to dwell 
on the intuitions and metaphors that link 
the formal, symbolic, computational 
model with the familiar concepts of con¬ 
current programming, via a sequence of 
analogies shown in Figure 5. I exemplify 
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them using the Concurrent Prolog pro¬ 
gram for quicksort, shown in Figure 6. In 
the meantime the read-only operator (?) 
can be ignored, and the commit operator 
(|) can be read as a conjunction (,). 

Following Edinburgh Prolog, the term 
[X | Xs] is a syntactic convention replacing 
list (A, As), and [ ] replaces nil. The list 
[1,21 Xs] is shorthand for [l|[2|As]], 
that is list(l,list(2,As), and [1,2,3] for 
list(l,list(2, list(3,nil»). The clauses for 
quicksort read: Sorting the list [ X\Xs ] 
gives Ys if partitioning Xs with respect to 
X gives Smaller and Larger, sorting Larger 
gives Ls, sorting Smaller gives &, and ap¬ 
pending [X | &] to Ls gives Ys. Sorting the 
empty list gives the empty list. 

The first clause of partition reads: Parti¬ 
tioning a list [X | In] with respect to X gives 
[ Y | Smaller] and Larger if X > Y and par¬ 
titioning In with respect to X gives Smaller 
and Larger. 

1) Goal = process 

Agoalp(7j,r 2 ,...,7'„) can be viewed as 
a process. The arguments of the goal (7j, 
T 2 ,...,T„) constitute the data state of the 
process. The predicate, pin (namep, arity 
n) , is the program state, which determines 
the procedure (set of clauses with same 
predicate name and arity) executed by the 
process. A typical state of a quicksort pro¬ 
cess might beqsort([5,38,4,7,19 | As], Ys). 

2) Conjunctive goal = network of pro 

cesses 

A network of processes is defined by its 
constituent processes and by the way they 
are interconnected. A conjunctive goal is a 
set of processes. For example, the body of 
the recursive clause of quicksort defines a 
network of four processes, one partition 
process, two quicksort processes, and one 
append process. The variables shared be¬ 
tween the goals in the conjunction deter¬ 
mine an interconnection scheme. This 
leads to a third analogy. 

3) Shared logical variable = communi¬ 
cation channel = shared-memory single¬ 
assignment variable 

A communication channel provides a 
means by which two or more processes 
may communicate information. A shared 
variable is another means for several pro¬ 
cesses to share or communicate informa¬ 
tion. A logical variable, shared between 
two or more goals (processes), can serve 
both these functions. For example, the 
variables Smaller and Larger serve as com¬ 
munication channels between partition 


and the two recursive quicksort processes. 

Logical variables are single-assignment, 
since a logical variable can be assigned 
only once during a computation. Hence, a 
logical variable is analogous to a com¬ 
munication channel capable of transmit¬ 
ting only one message, or to a shared- 
memory variable that can receive only one 
value. 

Note that under this single-assignment 
restriction the distinction between a com¬ 
munication channel and a shared-memory 
variable vanishes. It is convenient to view 
shared logical variables sometimes as 
analogous to communication channels 
and sometimes as analogous to shared- 
memory variables. 

The single-assignment restriction has 
been proposed as suitable for parallel pro¬ 
gramming languages independently of 
logic-programming. At first sight, it 
would seem a hindrance to the expressive¬ 
ness of Concurrent Prolog, but it is not. 
Multiple communications and cooperative 
construction of a complex data structure 
are possible by starting with a single, 
shared logical variable, as explained 
below. 

4) Clauses of a logic program = rules, or 

instructions, for process behavior 

The actions of a process can be separated 
into control actions and data actions. Con¬ 
trol actions include termination, iteration, 
branching, and creation of new processes. 
These are specified explicitly by logic pro¬ 
gram clauses. Data actions include commu¬ 
nication and various operations on data 
structures, such as single-assignment, in¬ 
spection, testing, and construction. As in 
sequential Prolog, data actions are speci¬ 
fied implicitly by the arguments of the head 
and body goals of a clause, and are realized 
through unification. 

Process reading of logic programs. Ter¬ 
mination, iteration, branching, state- 
change, and creation of new processes can 
be specified by clauses, using the process 
reading of logic programs. 

Terminate. A unit clause, or a definite 
clause with an empty body, 

P(T x ,T 2 . T„). 

specifies that a process in a state unifiable 
with p(T u T 2 ,...,T n ) can reduce itself to 
the empty set of processes and thus ter¬ 
minate. For example, the clause quick- 
sort([ ],[ ]) says that any process that 


— 

quicksort([A|As], V's;— 

partition(Xs?,X, Smaller, Larger), 
quicksort (Smaller?, Ss), 
pwcKsortf Larger?, Ls), 
append(Ss?,[X\Ls?],Ys). 
quicksort^ ],[ ]). 

partition(/Y| ln],X,[Y\ Smaller], Larger) - 
X>Y | pan\lion(ln?,X,Smaller,Larger). 
partition(/Y| In], X, Smaller, [Y\ Larger]) - 
X< Y | partition(/n?,X, Smaller,Larger). 
partitions )X[ ],[ ])• 
append ([X\Xs], Ys,[X\Zs])~- 
append (Xs?,Ys,Zs). 
appends ],Xs,Xs), 


Figure 6. A Concurrent Prolog Quicksort 
program. 

unifies with it, such as quicksort ([ ], Ks), 
may terminate. While doing so, this pro¬ 
cess unifies Ys with [ ], effectively closing 
its output stream. 

Change of data and program state. An 
iterative clause, or a clause with one goal in 
the body, 

p(Ti,T 2 ,...,T„)-q(S u S 2 . S m ). 

specifies that a process in a state unifiable 

withp(7j,r 2 . T„) can change its state to 

q(S u S 2 ,...,S m ). The program state is 
changed to q/m (branch) and the data state 
to (Si, S 2 .S^. For example, the recur¬ 

sive clause of append specifies that the pro¬ 
cess append([l,3,4,7,12 |LI],[21,22,25 
|Z,2],L3) can change its state to 
append([3,4,7,12| LI],[21,22,25, | L2],Zs). 
While doing so, it unifies L3 with [1 | Zs], 
effectively sending an element down its out¬ 
put stream. Since append branches back to 
itself, it is actually an iterative process. 

Create new processes. A general clause, 
of the form 

p(T lt T 2 . T n )~Q h Q 2 ,...,Q m . 

specifies that a process in a state unifiable 

with p(T\,T 2 . T n ) can replace itself 

with m new processes as specified by 
Q u Q 2 ,..., Q m . For example, the recursive 
clause of quicksort says that a quicksort 
process whose first argument is a list can 
replace itself with a network of four pro¬ 
cesses: one partition process, two quick¬ 
sort processes, and one append process. It 
further specifies their interconnection and 
initializes the first element in the list form¬ 
ing the second argument of append to be 
X, the partitioning element. Note that 
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under this reading an iterative clause can 
be viewed as specifying that a process can 
be replaced by another process, rather 
than change its state. These two views are 
equivalent. 

Recall the abstract interpreter in Figure 
5. Under the process reading, the resolvent 
(the current set of goals of the interpreter) 
is viewed as a network of concurrent pro¬ 
cesses, where each goal is a process. The 
basic action a process can take is process 
reduction: the unification of the process 
with the head of a clause, and its reduction 
to (or replacement by) the processes spec¬ 
ified by the body of the clause. The actions 
a process can take depend on its state—on 
whether its arguments unify with the argu¬ 
ments of the head of a given clause. 

Concurrency can be achieved by reduc¬ 
ing several processes in parallel. This form 
of parallelism is called And-parallelism. 
Communication is achieved by the assign¬ 
ment of values to shared variables, caused 
by the unification that occurs during pro¬ 
cess reduction. Given a process to reduce, 
all clauses applicable for its reduction may 
be tried in parallel. This form of parallel¬ 
ism, called Or-parallelism, is the source of 
a process’s ability to take indeterminate 
actions. 

Synchronization using the read-only 
and commit operators. In contrast to se¬ 
quential Prolog, in Concurrent Prolog an 
action taken by a process cannot be un¬ 
done: once a process has reduced itself 
using some clause, it is committed to it. 
The resulting computational behavior is 
called committed-choice nondeterminism, 
don’t care nondeterminism, and some¬ 
times also indeterminacy, to distinguish it 
from the “don’t-know” nondeterminism 
of the abstract interpreter. 

This design decision is common to other 
concurrent logic programming languages, 
including the original Relational Lan¬ 
guage [ClGr81], Parlog [ClGr86], and 
GHC [Sil86]. It implies that a process 
faced with a choice had better make a cor¬ 
rect one, lest it doom the entire computa¬ 
tion to failure. 

The basic strategy taken by Concurrent 
Prolog to ensure that processes make cor¬ 
rect choices of actions is to provide the 
programmer with a mechanism to delay 
process reductions until enough informa¬ 
tion is available that a correct choice can 
be made. 


The two synchronization and control 
constructs of Concurrent Prolog are the 
read-only and the commit operators. The 
read-only operator (indicated by a ques¬ 
tion-mark suffix *?’) can be applied to 
logical variables, such as XI, thus des¬ 
ignating them as read-only. The read-only 
operator is ignored in the declarative read¬ 
ing of a clause and can be understood only 
operationally. 

Intuitively, a read-only variable cannot 
be written upon (be instantiated). It can 
receive a value only through the instantia¬ 
tion of its corresponding write-enabled 
variable. A unification that attempts to in¬ 
stantiate a read-only variable suspends 
until that variable becomes instantiated. 
For example, the unification of XI with a 
suspends; that of f(X,Y!) with f(a,Z) 
succeeds, with unifier \X=a,Z=Yl ). In 
Figure 6, the unification of quicksort(/«?, 
Out) with both quicksort^ ],[ ]) and 
quicksorts | As],Kj) suspends, as does 
the unification of append(Ll?,[3,4,5 
| L2],L3) with the heads of its two clauses. 
However, as soon as Ini gets instantiated 
to [3 | In\] (for example, by another parti¬ 
tion process with a write-enabled occur¬ 
rence of In), the unification of the quick¬ 
sort goal with the head of the first clause 
fails, and with the second clause succeeds. 

Definition: The following captures this 
insight more precisely. Assume two 
distinct sets of variables, write-enabled 
variables and read-only variables. The 
read-only operator, ?, is a one-to-one 
mapping from write-enabled to read-only 
variables. It is written in postfix notation. 
For every write-enabled variable X, the 
variable XI is the read-only variable cor¬ 
responding to X. 

The extension of the read-only operator 
to terms which are not write-enabled vari¬ 
ables is the identity function. 

Definition: A substitution 6 affects a 
variable X if it contains a substitution ele¬ 
ment X= T. A substitution 8 is admissible 
if it does not affect any read-only variable. 

Definition: The read-only extension of 
a substitution 6, denoted 61, is the result of 
adding to 6 the substitution elements 
X?=T? for every X=T in 6 such that 
7V*?. 

Definition: The read-only unification of 
two terms 7) and T 2 succeeds, with read¬ 
only mgu 81,iITi and T 2 have an admissi¬ 
ble mgu 9. It suspends if every mgu of 7) 


and r 2 is not admissible. It fails if T\ and 
T 2 do not unify. 

Note that the definition of unifiability 
prevents the unification attempt to instan¬ 
tiate read-only variables. However, once 
the unification succeeds, the read-only 
unifier instantiates read-only variables in 
accordance with their corresponding 
write-enabled variables. 

The second synchronization and con¬ 
trol construct of Concurrent Prolog is the 
commit operator. A guarded clause is a 
clause of the form 

A—G\, G 2 ,...,G m \B u B 2 ,...,B„ 

m,n> 0 

The commit operator (|) separates the 
right hand side of a rule into a guard and a 
body. Declaratively, the commit operator 
is read just like a conjunction: A is true if 
the G’s and the B’s are true. Procedurally, 
the reduction of a process A i using such a 
clause suspends until A\ is unifiable with 
A, and the guard is determined to be true. 
Thus the guard is another mechanism for 
preventing or postponing erroneous pro¬ 
cess actions. 

As a syntactic convention, if the guard is 
empty (m=0), the commit operator is 
omitted. 

The read-only variables in the recursive 
invocations of quicksort, partition, and 
append cause them to suspend until it is 
known whether the input is a list or nil. 
The non-empty guard in the recursive 
clauses for partition allows the process to 
choose correctly the output stream on 
which to place its next input element. It is 
placed on the first stream if it is smaller 
than or equal to the partitioning element. 
It is placed on the second stream if it is 
larger than the partitioning element. 

Concurrent Prolog allows G’s, the goals 
in the guard, to be calls to general Concur¬ 
rent Prolog programs. Hence guards can 
be nested recursively, and testing the 
applicability of a clause for reduction can 
be arbitrarily complex. In the following 
discussion I restrict my attention to a 
subset of Concurrent Prolog called Flat 
Concurrent Prolog [Mie85], In Flat Con¬ 
current Prolog the goals in the guards can 
contain calls to a fixed set of simple test- 
predicates only. For example. Figure 6 is a 
Flat Concurrent Prolog program. 

In Flat Concurrent Prolog, the reduc¬ 
tion of a goal using a guarded clause suc¬ 
ceeds if the goal unifies with the clauses’ 
head and its guard test predicates succeed. 
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Input: A Flat Concurrent Prolog program P and a goal G 

Output: G6, if GB was an instance of G proved from P or deadlock otherwise. 

Algorithm: 

Initialize the resolvent to be G, the input goal. 

While the resolvent is not empty do 
choose a goal A in the resolvent 
and a fresh copy of a clause 

A' —G- 1 ,G 2 -, •. ■ ,G m | B- 1 , B 2 . B n in P 

such that A and A' have a read-only unifier B 
and the tests (G-| ,62. ■ ■ ■ .6 m ) 9 succeed 
(exit if such a goal and clause do not exist). 

Remove A from and add 61,62, ... ,B n to the resolvent 
Apply B to the resolvent and to G. 

If the resolvent is empty then output G, 
else output deadlock. 


Figure 7. An abstract interpreter for Flat Concurrent Prolog. 


Flat Concurrent Prolog is both the target 
langauge and the implementation lan¬ 
guage for the Logix system, discussed 
later. It is a rich enough subset of Concur¬ 
rent Prolog to be sufficient for most prac¬ 
tical purposes. It is simple enough to be 
amenable to an efficient implementation, 
resulting in a high-level concurrent pro¬ 
gramming language practical even on con¬ 
ventional uniprocessors. 

An abstract interpreter for Flat Concur¬ 
rent Prolog. Flat Concurrent Prolog is 
provided with a fixed set T of test 
predicates. Typical test predicates include 
string (A") (which suspends until A' is a 
nonvariable, then succeeds if it is a string, 
else fails) and X<Y (which suspends until 
X and Y are nonvariables, then succeeds if 
they are integers such that X<Y, else 
fails). 

Definition: A flat guarded clause is a 
guarded clause of the form 

A—G\, G 2 ,\B\, B 2 , 

m,n> 0 

such that the predicate of G, is in T, for 
all /, 0</<m. 

A Flat Concurrent Prolog program is a 
finite set of flat guarded clauses. 

An abstract interpreter of Flat Concur¬ 
rent Prolog is defined in Figure 7. The in¬ 
terpreter again leaves the nondeterministic 
choices for a goal and a clause unspecified: 
the scheduling policy, by which goals are 
added to and removed from the resolvent, 
and the clause selection policy, which in¬ 
dicates which clause to choose for reduc¬ 
tion when several clauses are applicable. 

Fairness in the scheduling and clause 
selection policies are discussed elsewhere 
[Tay85]. For concreteness, we will explain 
the choices made in Logix. Logix im¬ 
plements bounded depth-first scheduling. 
In bounded depth-first scheduling the 
resolvent is maintained as a queue and 
each dequeued goal is allocated a time- 
slice t. A dequeued goal can be reduced t 
times before it is returned back to the 
queue. If a goal is reduced using an 
iterative clause A—B, then B inherits the 
remaining time-slice. If it is reduced using 
a general clause A — B\, B 2 ,...,B„, then, 
by convention, B\ inherits the remaining 
time-slice and B 2 to B„ are enqueued to the 
back of the queue. Bounded depth-first 
scheduling reduces the overhead of pro¬ 
cess switching and allows more effective 
cashing of process arguments in registers. 


Logix also implements stable clause selec¬ 
tion, which means that if a process has 
several applicable clauses for reduction, 
the first one (textually) will be chosen. 
Stability is a property that can be abused 
by programmers. It is hard to preserve in a 
distributed implementation [Tay85] and 
makes the life of optimizing compilers 
harder. It is not part of the language 
definition. 

In addition, Logix implements a non¬ 
busy waiting mechanism, in which a sus¬ 
pended process is associated with the set of 
read-only variables that causes the suspen¬ 
sion of its clause reductions. If any of the 
variables in that suspension set gets instan¬ 
tiated, the process is activated and en¬ 
queued to the back of the queue. 

The abstract interpreter models concur¬ 
rency by interleaving. The truly parallel 
implementation of the language requires 
that each process reduction be viewed as 
an atomic transaction, which reads from 
and writes to logical varaibles. A parallel 
interpreter must ensure that its resulting 
behavior is serializable (can be ordered to 
correspond to some possible behavior of 
the abstract interpreter). Such an algo¬ 
rithm has been designed [Tay85] and im¬ 
plemented on Intel’s iPSC at the Weiz- 
mann Institute. 


Concurrent Prolog 

programming 

techniques 

In the past three years of its use, Con¬ 
current Prolog has collected a wide range 


of programming techniques. Some are 
simply known concurrent programming 
techniques restated in the formalism of 
logic programming, such as divide-and- 
conquer, monitors, stream processing, 
and bounded buffers. Others are novel 
techniques that exploit the unique aspects 
of logic programs, notably the logical 
variable. Examples include difference 
streams, incomplete messages, and the 
short-circuit technique. Some techniques 
exploit properties of the read-only vari¬ 
able, such as blackboards, constraint sys¬ 
tems, and protected data structures. 

Perhaps the most important in the long 
run are the meta-programming tech¬ 
niques. Using enhanced meta-interpret¬ 
ers, one can implement a wide spectrum of 
programming environment and operating 
system functions (such as inspecting and 
affecting the state of the computation, and 
detecting distributed termination and 
deadlock) in a simple and uniform way 
[SafSh86], [Hir86]. 

In the following account of these tech¬ 
niques, breadth was preferred over depth. 

Divide-and-conquer: recursion and 
communication. Divide-and-conquer is a 
method for solving a problem by dividing 
it into subproblems, solving them, possi¬ 
bly in parallel, and combining the results. 
If the subproblems are small enough, they 
are solved directly; otherwise, they are 
solved by applying the divide-and-conquer 
method recursively. Parallel divide-and- 
conquer algorithms can be specified easily 
in both functional and logic languages. 
Divide-and-conquer becomes more in- 
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number(leaf(/Vj,/V,/V1)— 
plusC/V?, 1 ,/VI). 
number{treef/., R),N,N 2 )~ 
number(Z_?,/V?,/V1), 
number(/??,/V 1 ?,/V 2 ) 


Figure 8. Numbering the leaves of a tree: 
recursion with general communication. 


teresting when it involves cooperation, 
hence direct communication, among the 
processes solving the subproblems. The 
program in Figure 8 solves a problem 
pointed out by Leslie Lamport. The prob¬ 
lem is to number the leaves of a tree in 
ascending order from left to right, by the 
following recursive algorithm: spawn leaf 
processes, one per leaf, in such a way that 
each process has an input channel from the 
leaf process to its left and an output chan¬ 
nel to the leaf process to its right. The left¬ 
most leaf process is initialized with a 
number. Each process receives a number 
from the left, numbers its leaf with it, 
increments it by one, and sends the result 
to the right. The problem is shown in order 
to explore the problematics of combining 
recursion with communication, and is not 
necessarily a useful parallel algorithm. 

The program assumes that binary trees 
are represented using the terms leaf (X) 
and tree(L,/?), For example, tree(leaf 
(XI), treefieaf(A2), leaf (A3))) is a tree with 
three leaves. 

The program works in parallel on the 
two subtrees of a tree until it reaches a leaf, 
where it spawns a plus process. A plus pro¬ 
cess suspends until its first two arguments 
are integers, then unifies the third with 
their sum. The plus processes, however, 
cannot operate in parallel. Rather, they 
are synchronized in such a way that they 
are activated one at a time, starting from 
the leftmost node. 

The program passes the communication 
channels to the leaf processes in a simple 
and uniform way, through unification. It 
numbers a leaf by unifying its value with 
the left channel, even before that channel 
has transmitted a value. 

Stream processing. Concurrent Prolog 
is a single-assignment programming lan¬ 
guage, in that a logical variable can be 


assigned to a nonvariable term only once 
I during a computation. Hence, it seems 
I that, as a communciation channel, a 
I shared logical variable can transmit at 
I most one message between two processes. 
I This is not quite true. A variable can be 
I assigned to a term that contains a message 
| and another variable. This new variable is 
shared by the processes that shared the 
original variable. Hence it can serve as a 
new communication channel, which can 
be assigned to a term that contains an 
additional message and an additional 
variable, and so on ad infinitum. 

This idea is the basis of stream com¬ 
munication in Concurrent Prolog. In 
stream communication, the communicat¬ 
ing processes, typically one sender and one 
receiver (also called the stream’s producer 
and consumer), share a variable, say Xs. 
The sender who wants to send a sequence 
of messages m u m 2 , m 2 ,... assigns Xs to 
[wj | Xsl] in order to send m lt then instan¬ 
tiates Ail to [m 2 | Xs2] to send m 2 , then 
assigns Xs2 to [m 2 1 Xs3], and so on. The 
receiver inspects the read-only variable 
Xs?, attempting to unify it with 
[AT] | Ail]. When successful, it can pro¬ 
cess the first message M\ and iterate with 
Ail?, waiting for the next message. 

Exactly the same technique would work 
for one sender and multiple receivers, pro¬ 
vided that all receivers have read-only ac¬ 
cess to the original shared variable. A 
receiver that spawns a new process can in¬ 
clude it in the group of receivers by pro¬ 
viding it with a read-only reference to the 
current stream variable. 

The program for quicksort in Figure 7 
demonstrates stream processing. Each 
partition process has one input stream and 
two output streams. On each iteration it 
consumes one element from its input 
stream and places it on one of its output 
streams. When it reaches the end of its in¬ 
put stream, it closes its two output streams 
and terminates. The append process from 
the same program is a simpler example of a 
stream processor. It copies its first input 
stream into its output stream, and when it 
reaches the end of the first input stream, it 
binds the second input stream to its output 
stream and terminates. 

Stream merging. Streams are the basic 
communication means between processes 
in Concurrent Prolog. It is sometimes 
necessary, or convenient, to allow several 


processes to communicate with one other 
process. This is achieved in Concurrent 
Prolog using a stream merger. 

A stream merger is not a function, since 
its output—the merged stream—can be 
any one of the possible interlievings of its 
input streams. Hence, stream-based func¬ 
tional programming languages incor¬ 
porate stream mergers as a language 
primitive. In logic programming, how¬ 
ever, a stream merger can be defined 
directly, as was shown by Clark and 
Gregory [ClGr81], Their definition, 
adapted to Concurrent Prolog, is shown in 
Figure 9. 

As a logic program, Figure 9 defines 
the relation containing all facts 
merge(Ai, Ys.Zs), in which the list Zsis an 
order preserving interlieving of the 
elements of the lists Xs and Ys. As a pro¬ 
cess, merge (Xs?, Ys?,Zs) behaves as 
follows: If neither Xs nor Ys are instan¬ 
tiated, it suspends, since unification with 
all three clauses suspends. If Xs is a list, 
then it can reduce using the first clause, 
which copies the list element to Zs, its out¬ 
put stream, and iterates with the updated 
streams. Similarly with Ys and the second 
clause. If it has reached the end of its input 
streams, it closes its output stream and ter¬ 
minates, as specified by the third clause. 

In case both Xs and Ys have elements 
ready, either the first or the second clause 
can be used for reduction. The abstract 
interpreter of Flat Concurrent Prolog, 
defined in Figure 4, does not dictate which 
one to use. This may lead to an unfor¬ 
tunate situation, in which one clause (say 
the first) is always chosen, and elements 
from the second stream never appear in 
the output stream. A stream merger that 
allows this is called unfair. There are 
several techniques to implement fair 
mergers in Concurrent Prolog [ShMie84], 
[ShSaf86], [Ued84], 

Recursive process networks. The recur¬ 
sive structure of Concurrent Prolog, 
together with the logical variable, makes it 
a convenient language for specifying 
recursive process networks. An example is 
the quicksort program above. Although 
hard to visualize, the program forms two 
tree-like networks: a tree of partition pro¬ 
cesses, which partitions the input list into 
smaller lists, and a tree of append pro¬ 
cesses, which concatenates these lists 
together. 


50 


COMPUTER 














Process trees are useful for divide-and- 
conquer algorithms and for searching, 
among other things. Consider an applica¬ 
tion to stream merging. An n-ary stream 
merger can be obtained by composing 
n -1 binary stream mergers in a process 
tree. A program for creating a balanced 
tree of binary merge operators is shown in 
Figure 10. 

Figure 10 creates a merge tree layer by 
layer, using an auxiliary procedure 
mergeJayer. The merge trees defined are 
static: the number of streams to be merged 
should be defined in advance, and cannot 
be changed easily. Multiway dynamic 
merge trees can be implemented in Con¬ 
current Prolog, using the concept of 
2-3-trees [ShMie84], Ueda and Chik- 
ayama [Ued84] and Shapiro and Safra 
[ShSaf86] improve this scheme further. 

More complex process structures, in¬ 
cluding rectangular and hexagonal process 
arrays [Sha84], quad-trees [Ede85], and 
pyramids, can easily be constructed in 
Concurrent Prolog. These process struc¬ 
tures are found useful in programming 
systolic algorithms and spawning virtual 
parallel machines [Tay86], 

Systolic programming: parallelism with 
locality and pipelining. Systolic algo¬ 
rithms were designed originally by Kung 
and his colleagues 6 for implementation on 
special purpose hardware. However, they 
are based on two rather general principles: 

1) Localize communication and 

2) Overlap and balance computation 
with communication. 

The advantages of implementing sys¬ 
tolic algorithms on general purpose paral¬ 
lel computers using a high-level language, 
compared to implementation in special 
purpose hardware, are obvious. The sys¬ 
tolic programming approach [Sha84] was 
conceived in an attempt to apply the 
systolic approach to general purpose 
parallel computers. 

The specification of systolic algorithms 
in Concurrent Prolog is rather straightfor¬ 
ward. However, to ensure that perfor¬ 
mance is preserved in the implementation, 
two aspects of the execution of the pro¬ 
gram need explicit attention. One is the 
mapping of processes to processors, which 
should preserve the locality of the algo¬ 
rithm by using the locality of the architec¬ 
ture. Another is the communication pat¬ 
tern employed by the processes. 


In the systolic programming approach 
[Sha84], the mapping is done using a 
special notation, Logo-like Turtle pro¬ 
grams. Each process, like a turtle in Logo, 
is associated with a position and a heading. 
A goal in the body of a clause may have a 
Turtle program associated with it. When 
activated, this Turtle program, applied to 
the position and heading of the parent 
process, determines the position and head 
ing of the new process. Using this nota¬ 
tion, complex process structures can be 
mapped in the desired way. Programming 
in Concurrent Prolog augmented with 
Turtle programs as a mapping notation is 
as easy as mastering a herd of turtles. 

Pipelining is the other aspect that re¬ 
quires explicit attention. The performance 
of many systolic algorithms depends on 
routing communication in specific pat¬ 
terns. The abstract specification of a 
systolic algorithm in Concurrent Prolog 
often does not enforce a communication 
pattern. However, the tools to do that are 
in the language. By appropriate transfor¬ 
mations, broadcasting can be replaced by 
pipelining, and specific communication 
patterns can be enforced [Tayl86], For ex¬ 
ample, the program in Figure 11 is a 
Turtle-annotated Concurrent Prolog pro¬ 
gram for multiplying two matrices, based 
on the classic systolic algorithm that 
pipelines two matrices orthogonally on the 
rows and columns of a processor array. It 
assumes that the two input matrices are 
represented by a stream of streams of their 
columns and rows respectively. It pro¬ 
duces a stream of streams of the rows of 
the output matrix. The program operates 
by spawning a rectangular grid of ip pro¬ 
cesses for computing the inner products of 
each row and column. Unlike the original 
systolic algorithm, this program does not 
pipeline the streams between ip processes, 
but rather broadcasts them. However, 
pipelining can easily be achieved by add¬ 
ing two additional streams to each process 
[Sha84]. 

The logical variable. All the program¬ 
ming techniques shown before can be real¬ 
ized in other computation models, with 
various degrees of success. For example, 
stream processing can be specified with 
functional notation. 7 By adding a non- 
deterministic constructor to functional 
languages, they can even specify stream 
mergers. 8 Using simultaneous recursion 


merge([X|/s;, V's,7X|Zs77-merge(,X's?, Ys?,Zs). 
mergers, [Y\ Ys],[Y\Zs])~merge(Xs?, Ys?,Zs). 
merged ],[ ],[ ]). 


Figure 9. A binary stream merger. 


merge Jree(Soffom, Top) — 

Bottom ?*[_] j 

mergeJayer {Bottom, Bottoml ), 
merge Jree(0offom1 ?, Top). 
merge_tree([Xs],Xs). 

merge Jayer(7Xs, Ys\Bottom],[Zs\Bottomt ?])— 
mergers?, Ys?,Zs), 
mergeJayer (Bottom ?, Bottom 1). 
merge Jayer([Xs7,7Xs7j. 
mergeJayer([ ],[ ]). 
merge(Xs,V's,Zs7—(See Figure 15) 


Figure 10. A balanced binary merge tree. 


mm([ ],_,[ ]). 
mm(7X|Xs7,V's7Z|Zs77- 
vmi'X, fs?,Z7@right, 
mmfXs ?, Ys, Zsj@forward. 
vm(—,[ ],[ ]). 
m(Xs,[Y\ Ys],[Z\Zs])~ 

ip (Xs?. Y?,Z), vm (Xs, V's?,Zsj@forward. 
WftXs],lY\Y$},Z)- 

Z:=(X*Y)+Zt, \p(Xs?,Ys?,Zt). 
ip([ ].[ ],0). 


Figure 11. Matrix multiplication. 


equations one can specify recursive pro¬ 
cess networks. 

In this section I show Concurrent Pro¬ 
log programming techniques unique to 
logic programming, as they rely on prop¬ 
erties of the logical variable. Of course, 
one can take a functional programming 
language, extend it with stream construc¬ 
tors, nondeterministic constructors, si¬ 
multaneous recursion equations, and 
logical variables, and perhaps achieve 
these techniques as well. But why approx¬ 
imate logic programming from below, in¬ 
stead of just using it? 

Incomplete messages. An incomplete 
message is a message that contains one or 
more uninstantiated variables. An in¬ 
complete message can be viewed in various 
ways, including 
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stack([pushM|S/J- 

stack(ln?,[X\S]). 
steck([pop(X)\ln],[X\S])- 
stack (ln?,S). 
stack([ ],[ ]). 


Figure 12. A stack monitor. 


• a message being sent incrementally, 

• a message containing a communica¬ 
tion channel as an argument, 

• a message containing implicitly the 
identity of the sender, and 

• a data structure being constructed 
cooperatively. 

The first and second views are taken by 
stream processing programs. A stream is 
just a message being sent incrementally, 
and each list-cell in the stream is a message 
containing the stream variable to be used 
in the subsequent communication. Simi¬ 
larly, the processes for constructing the 
merge trees communicate by incomplete 
messages, each containing a stream of 
streams. 

However, the sender of an incomplete 
message need not be the one to complete 
it. The receiver could also complete the 
message. Two Concurrent Prolog pro¬ 
gramming techniques—monitors and 
bounded-buffers [Tak85]—operate this 
way. Monitors also take the third view, 
that an incomplete message implicitly 
holds the identity of its sender. This view 
enables rich communication patterns to be 
specified without the need for an extra 
layer of naming conventions and com¬ 
munication protocols, by providing a sim¬ 
ple mechanism for replying to a message. 

Monitors. Monitors were introduced 
into conventional concurrent program¬ 
ming languages by Hoare 9 as a technique 
for structuring the management of shared 
data. A monitor has some local data, 
which it maintains, and some procedures, 
or entries, defined for manipulating and 
examining the data. A user process that 
wants to update or inspect the data per¬ 
forms the relevant monitor call. 

The monitor has built-in synchroniza¬ 
tion mechanisms, which prevent different 
callers from updating the data simultane¬ 


ously and allow the inspection of data only 
when it is in an integral state. One of the 
convenient aspects of monitors is that the 
process performing a monitor call does 
not need to identify itself explicitly. 
Rather, some of the arguments of the 
monitor call (which syntactically looks 
similar to a procedure call) serve as the 
return address for the information pro¬ 
vided by the monitor. When the monitor 
call completes, the caller can inspect these 
arguments and find there the answer to its 
query. 

Stream-based languages can mimic the 
concept of a monitor as follows 10 : A 
designated process, the monitor process, 
maintains the data to be shared. Users of 
the data have streams connected to the 
monitor by a merger. Monitor calls are 
simply messages to the monitor that up¬ 
date the data and respond to queries ac¬ 
cording to the message received. The ele¬ 
gance in this scheme is that no special 
language constructs need be added in 
order to achieve this behavior: the con¬ 
cepts already available, of processes, 
streams, and mergers, are sufficient. The 
awkward aspect of this scheme is routing 
the response back to the sender. 

Fortunately, in Concurrent Prolog in¬ 
complete messages allow responses to 
queries to be routed back to the sender 
directly, without the need for an explicit 
naming and routing mechanism. Both the 
underlying mechanism required to imple¬ 
ment incomplete messages and the result¬ 
ing effect from the user’s point of view are 
similar to conventional monitors, where a 
process that performs a monitor call finds 
the answer by inspecting the appropriate 
argument of the call, after the call is 
served. Hence, Concurrent Prolog pro¬ 
vides the convenience of monitors, while 
maintaining the elegance of stream-based 
communication. In contrast to conven¬ 
tional monitors, Concurrent Prolog moni¬ 
tors are not a special language construct, 
but simply a programming technique for 
organizing processes and data. 

The program in Figure 12 implements a 
simple stack monitor. It understands two 
messages: push (A), on which it changes 
the stack contents S to [AT | S], and 
pop (X ), to which it responds by unifying 
the top element of the stack with X and 
changing the stack contents to contain the 
remaining stack, pop (X) is an example of 
an incomplete message. 


Detecting distributed termination: the 
short-circuit technique. Concurrent Pro¬ 
log does not contain a sequential-AND 
construct. Suggestions to include one were 
resisted for two reasons. First, a desire to 
keep the number of language constructs 
down to a minimum. Second, the belief 
that even if eventually such a construct 
would be needed, introducing it at an early 
stage would encourage awkward and lazy 
thinking. Instead of using Concurrent 
Prolog’s dataflow synchronization mech¬ 
anism, programmers would resort to the 
familiar sequential construct. (Early 
Prolog-in-Lisp implementations, which 
provided an easy cop-out to Lisp, had a 
similar fate. Users of these systems, 
typically experienced Lisp hackers, would 
resort to Lisp whenever they were con¬ 
fronted with a difficult programming 
problem, instead of thinking it through in 
Prolog. This led some to conclude that 
Prolog “wasn’t for real.”) 

In retrospect, this decision proved to be 
very important, both from an educational 
and an implementation point of view. 
Concurrent Prolog still does not have 
sequential-AND and Logix does not have 
the necessary underlying machinery to im¬ 
plement it, even if it were desired. The 
reason is that implementing sequential- 
AND in Concurrent Prolog on a parallel 
machine requires solving the problem of 
distributed termination detection. To run 
P&Q (assuming that & is the sequential- 
AND construct), one has to detect that P 
has terminated in order to proceed to Q. If 
P spawned many parallel processes that 
run on different processors, it requires 
detecting when all of them have termi¬ 
nated, which is a rather difficult problem 
for an implementation to solve. 

On the other hand, there is sometimes a 
need to detect when a computation ter¬ 
minates. First of all, as a service to the pro¬ 
grammer or user who wishes to know 
whether his program worked properly and 
terminated, or if it has some useful or 
useless processes still running there in the 
background. Second, when interfacing 
with the external environment there is a 
need to know whether a certain set of 
operations, such as a transaction, has 
completed in order to proceed. 

This problem can be solved using a very 
elegant Concurrent Prolog programming 
technique, called the short-circuit tech¬ 
nique, created by Takeuchi. The idea is 
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simple: Chain the processes in a certain 
computation using a circuit, where each 
active process is an open switch on the cir¬ 
cuit. When a process terminates, it closes 
the switch and shortens the circuit. When 
the entire circuit is shortened, global ter¬ 
mination is detected. 

The technique is implemented using 
logical variables, as follows: Each process 
is invoked with two variables, Left and 
Right, where the Left of one process is 
unified with the Right of another. The left¬ 
most and rightmost processes each have 
one end of the chain connected to the 
manager. The manager instantiates one 
end of the chain to some constant and 
waits until the variable at the other end is 
instantiated to that constant as well. Each 
process that terminates unifies its Left and 
Right variables. When all terminate, the 
entire chain becomes one variable and the 
manager sees the constant it sent on one 
end appearing on the other. 

An example of using the short-circuit 
technique appears in Figure 13b. 

Meta-programming and partial evalua¬ 
tion. Meta-programs are programs that 
treat other programs as data. Examples of 
meta-programs include compilers, as¬ 
semblers, and debuggers. 11 One of the 
most important and useful types of meta¬ 
programs is the meta-interpreter, some¬ 
times called a meta-circular interpreter, 
which is an interpreter for a language writ¬ 
ten in that language. 

A meta-interpreter is important from a 
theoretical point of view, as a measure for 
the quality of the language design. Design¬ 
ing a language with a simple meta-inter- 
preter is like solving a fixpoint equation: if 
the language is too complex, its meta¬ 
interpreter would be large. If it is too 
weak, it won’t have the necessary data 
structures to represent its programs and 
the control structures to simulate them. 

A language may have several meta¬ 
interpreters of different granularities. In 
logic programs, the most useful meta¬ 
interpreter is the one that simulates goal 
reduction, but relies on the underlying im¬ 
plementation to perform unification. An 
example of a Flat Concurrent Prolog 
meta-interpreter at this granularity is 
shown in Figure 13a. The meta-interpreter 
assumes that a guardless clause A —B in 
the interpreted program is represented 
using the unit clause clause(A,fl). If the 


reduce(tniej. % halt 

reduce((A,S))- % fork 

reduce^?;, reduce(fl?j. 

reduce^— % reduce 

A * true, A ?*(—,_) | 
clause(A?,e;,reduceCS?;. 

(a) 

reduc e(A,Done)— 

reduce1(A, done—Done). 
reducel (true, Done—Done). 
reduce 1 ((A, B), Left—Right) - 
reduce 1 (A ?, Left—Middle), 
reducel(A,Left—Right )— 

A*true, A *(_,_) | 
clause(A?,e;, reducel (B?,Left-Right). 

W_ 


body of the clause is empty, then B= true. 
A guarded clause A — G \ B is represented 
by clause (A,B) —G \ true. 

The plain meta-interpreter is interesting 
mostly for a theoretical reason, as it does 
nothing except simulate the program being 
executed. Flowever, slight variations on it 
result in meta-interpreters with very useful 
functionalities. For example, by extending 
it with a short circuit, as in Figure 14, a 
termination-detecting meta-interpreter is 
obtained. 

Many other important functions can be 
implemented by enhanced meta-interpret- 
ers [SafSh86], In Prolog, they have been 
used to implement explanation facilities 
for expert systems. In compiler-based Pro¬ 
log systems, as well as in Logix, the debug¬ 
ger is based on an enhanced meta-inter¬ 
preter, and layers of protection and con¬ 
trol are defined through meta-interpreters 
[Hir86]. Such meta-interpreters, including 
abortable, interruptible, failsafe, and 
deadlock-detecting meta-interpreters, are 
shown and explained elsewhere. One 
problem with using such meta-interpreters 
directly is the execution overhead of the 
added layer of interpretation, which is un¬ 
acceptable in many applications. 

Partial evaluation, a program-transfor¬ 
mation technique, can eliminate the over¬ 
head of meta-interpreters [SafSh86]. 12 In 
effect, partial evaluation can turn en¬ 
hanced meta-interpreters into compilers, 
which produce as output the input pro¬ 
gram enhanced with the functionality of 
the meta-interperter. 

Modular programming and program¬ 
ming in the large. The techniques shown 


Figure 13. A plain meta-interpreter for Flat 
Concurrent Prolog (a) and a termination 
detecting meta-interpreter (b). 




% halt 
% fork 


% reduce 


above refer mostly to programming in the 
small. This does not mean that Concurrent 
Prolog is not suitable for programming in 
the large. On the contrary, I found that 
even using the simple module system de¬ 
veloped for bootstrapping Logix, many 
people could cooperate in its develop¬ 
ment. I expect the situation to improve 
further using the hierarchical module 
system, currently under development. 

The key idea in these module systems, 
which are implemented entirely in Concur¬ 
rent Prolog, is to use Concurrent Prolog 
message-passing to implement inter¬ 
module calls. This means that no addi¬ 
tional communication mechanism is 
needed to support remote procedure calls 
between modules which reside on different 
processors. 

The development of 
Concurrent Prolog 

Concurrent Prolog was conceived and 
first implemented in November 1982, in an 
attempt to extend Prolog to a concurrent 
programming language and to clean up 
and generalize the Relational Language of 
Clark and Gregory [ClGr81]. Although 
one of the goals of the language was to be a 
superset of sequential Prolog, the propos¬ 
ed design did not seem, on the face of it, to 
achieve this goal, hence was termed a 
subset of Concurrent Prolog [Sha83]. 

A major strength of that language, 
which later became known simply as Con¬ 
current Prolog, was that it had a working, 
usable, implementation: an interpreter 
written in Prolog [Sha83]. Since the con- 
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cepts of the language were quite radical at 
the time, it seemed fruitful to try and ex¬ 
plore them experimentally, by writing pro¬ 
grams in the language, rather than to get 
involved in premature arguments on lan¬ 
guage constructs, or to implement the 
language “for real” before its concepts 
were explored and understood, or to 
extend this “language subset” prema¬ 
turely, before its true limitations were 
encountered. 

In this respect the development of Con¬ 
current Prolog deviated from the common 
practice of research on a new program¬ 
ming language. This typically concentrates 
on theoretical aspects of the language 
definition (as for CCS 13 ), or attempts to 
construct an efficient implementation of it 
(as for Pascal), but rarely focuses on ac¬ 
tual usage of the language through a 
prototype implementation. 

This exploratory activity proved tre¬ 
mendously useful. Novel ways of using 
logic as a programming language were 
unveiled [Sha83], [ShaTak83], and tech¬ 
niques for incorporating conventional 
concepts of concurrent programming in 
logic were developed [Sha86], [ShMie84], 
Most importantly, a large body of work¬ 
ing Concurrent Prolog programs that 
solve a wide range of problems and imple¬ 
ment many types of algorithms were 
gathered. This activity, which continued 
for a period of about two years, mostly at 
ICOT and at the Weizmann Institute, 
resulted in papers on “How to do X in 
Concurrent Prolog” for numerous A”s. 

A programming language cannot be 
general purpose if only a handful of ex¬ 
perts can grasp it and use it effectively. To 
investigate how easy Concurrent Prolog is 
to learn, I have taught Concurrent Prolog 
programming courses at the Weizmann 
Institute and at the Hebrew University at 
Jerusalem. Altogether, about 90 graduate 
and 100 undergraduate students in com¬ 
puter science have attended these courses. 
Based on performance in programming 
assignments and on the quality of the 
course’s final programming projects, it 
seems that more than three-quarters of the 
students became effective Concurrent 
Prolog programmers. 

Accumulated experience suggested that 
Concurrent Prolog would be an expressive 
and productive general-purpose program¬ 
ming language, if implemented efficiently. 
The strength of the language was per¬ 


ceived mostly in systems programming 
and in the implementation of parallel and 
distributed algorithms. It also seemed 
suitable for the implementation of knowl¬ 
edge-programming tools for artificial in¬ 
telligence applications and as a system- 
description and simulation language. 

The next step was to try and develop an 
efficient implementation of the language 
on a uniprocessor, to serve as a building 
block for a parallel implementation and 
as a tool for exploring and testing the ap¬ 
plicability of the language further. This 
proved surprisingly difficult. Interpreters 
for the language developed at the Weiz¬ 
mann Institute exhibited miserable perfor¬ 
mance. A compiler of Concurrent Prolog 
on top of Prolog was developed at ICOT 
[UedChi85]. Although the latest version 
of the compiler reached a speed of more 
than 10,000 reductions per second, which 
is more than a quarter of the speed of the 
underlying Prolog system on that 
machine, it did not scale to large applica¬ 
tions because it employed busy waiting. 

In addition to the implementation dif¬ 
ficulties, subtle problems and opacities in 
the definition of the OR-parallel aspect of 
Concurrent Prolog were uncovered. 

As a result of these difficulties, we 
decided to switch research direction and 
concentrate our implementation effort on 
Flat Concurrent Prolog, the AND-parallel 
subset of Concurrent Prolog. Flat Con¬ 
current Prolog was a legitimate subset of 
Concurrent Prolog for two reasons. First, 
it has a simple meta-interpreter, shown 
above as Figure 13. Second, we discovered 
that almost all the applications written in 
Concurrent Prolog previously are either in 
its Flat subset already, or can be easily 
hand-converted into it. This demonstrated 
the utility of having a large body of Con¬ 
current Prolog code. Without it, we would 
not have had the courage to make what 
seemed to be such a drastic cut in the 
language. 

There was one Concurrent Prolog pro¬ 
gram that would not translate into Flat 
Concurrent Prolog easily: an OR-parallel 
Prolog interpreter. This four-clause pro¬ 
gram, written by Ken Kahn and shown in 
Figure 14, was simultaneously the final 
victory of Concurrent Prolog and its 
deathblow. It was a victory for the prag¬ 
matic expressiveness of Concurrent Pro¬ 
log, since it showed that without extending 
the original subset of Concurrent Prolog, 


the language was as expressive as Prolog: 
any pure Prolog program can run on a 
Concurrent Prolog machine (with OR- 
parallelism for free) by adding to it the 
four clauses of Kahn’s interpreter. Thus 
the original design goal of Concurrent 
Prolog—to have a concurrent program¬ 
ming language that includes Prolog—was 
actually achieved, though it took more 
than a year for us to realize it. 

It was a deathblow to the implement- 
ability of Concurrent Prolog, at least for 
the time being, since it showed that im¬ 
plementing Concurrent Prolog efficiently 
is as hard as, and probably harder than, 
implementing OR-parallel Prolog. As we 
all know, no one knows how to implement 
OR-parallel Prolog efficiently, as yet. 

Once the switch to Flat Concurrent Pro¬ 
log was made, in June 1984, implementa¬ 
tion work progressed rapidly. A simple 
interpreter for the language was imple¬ 
mented in Pascal [Mie85]. An abstract in¬ 
struction set for Flat Concurrent Prolog 
was designed, based on the Warren In¬ 
struction Set for unification 5 and the 
abstract machine embodied in the FCP 
interpreter [Hou86], and an initial version 
of the compiler was written in Flat Con¬ 
current Prolog. 

In July 1985, the bootstrapping of this 
compiler-based system was completed. 
The system, called Logix [Sil86], is a 
single-user multi-tasking program devel¬ 
opment environment. It consists of 

• a five-pass compiler, including a 
tokenizer, parser, preprocessor, en¬ 
coder, and assembler: 

• an interactive shell, which includes a 
command-line editor and supports 
management and inspection of mul¬ 
tiple parallel computations: 

• a source level debugger, based on a 
meta-interpreter: 

• a module system that supports 
separate compilation, runtime link¬ 
ing, and a free mixing of inter¬ 
preted—debuggable—and compiled 
modules; 

• a tty-controller, which allows multi¬ 
ple parallel processes, including the 
interactive shell, to interact with the 
user in a consistent way: 

• a simple file-server, which interfaces 
to the Unix file system: and 

• some input, output, profiling, style¬ 
checking, and other utilities. 
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The system is written in Flat Concurrent 
Prolog. Its source is about 10,000 lines of 
code long, divided between 45 modules. 
About half of it is the compiler. 

The system uses no side-effects or other 
extra-logical constructs, except in a few 
well-defined places. In the interface to the 
physical devices, low-level kernels make 
the keyboard and screen look like Concur¬ 
rent Prolog input and output streams of 
bytes, and the Unix file system looks like a 
Concurrent Prolog monitor that main¬ 
tains an association table of (FileName, 
FileContents). In the multiway stream 
merger and distributer, heavily used by the 
rest of the system, destructive assign¬ 
ment is used to achieve constant delay 
[ShSaf86], compared with the logarithmic 
delay that can be achieved in pure Concur¬ 
rent Prolog [ShMie84]. 

The other part of the system, written in 
C, includes an emulator of the abstract 
machine, an implementation of the ker¬ 
nels, and a stop-and-copy garbage collec¬ 
tor [Hou86]. It is about 6000 lines of code 
long. When compiled on the DEC VAX, 
the emulator occupies about 60K bytes, 
and Logix another 300K bytes. (At the 
moment we use word encoding, rather 
than byte encoding, for the abstract 
machine instructions.) When idle, Logix 
consists of about 750 Concurrent Prolog 
processes. Logix itself is running as one 
Unix process. 

The compiler compiles about 100 source 
lines per CPU minute on a VAX 11/750. A 
run of the compiler on the encoder, which 
is about 400 lines long, creates about 
31,000 temporary Concurrent Prolog pro¬ 
cesses and generates about 1.5M bytes of 
temporary data structures (garbage). Dur¬ 
ing this computation about 90,000 process 
reductions occur and 10,000 process 
suspensions/activations. 

Overall, the system achieves at present 
about a fifth to a quarter of the speed of 
Quintus Prolog, which is the fastest com¬ 
mercially available Prolog on the VAX 
today. The number is obtained by com¬ 
paring Concurrent Prolog process reduc¬ 
tions to Prolog procedure calls for the 
same logic programs. This indicates that 
the efficiency of Warren’s abstract Prolog 
machine 5 (the basis of Quintus Prolog) is 
about the same as our Flat Concurrent 
Prolog machine. The gap can be closed by 
rewriting our emulator in assembly lan¬ 
guage, as Quintus does. To explain this 
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solved ])■ 
solve([4|4s])- 
clauses(ACs), 
resolve(4?,Cs?,4s?). 
resolve(4, [(4 - Bs) | Cs] ,4s) - 
append(fls?,4s?,4fls), 
solve(4,Ss?) | true. 
resolve(4,[C|Cs],4s)- 

resolve(4?,Cs?,4s?) | true. 
append(/s, Ys,Zs )~(See Figure 6). 
clauses(4,Cs)-Cs is the list of clauses in 4’s procedure. 

Figure 14. Kahn’s OR- 
parallel Prolog interpreter. 


similarity in performance, recall that 
although Flat Concurrent Prolog needs to 
create and maintain processes, which is a 
bit more expensive than creating stack 
frames for Prolog procedure calls, it does 
not support deep backtracking, where 
Prolog does and pays dearly for it. 

Efforts at ICOT and 
Imperial College: GHC 
and Parlog 

In the meantime ICOT did not stand 
still. Given their decision to use Concur¬ 
rent Prolog as the basis for Kernel 
Language 1, 14 the core programming lan¬ 
guage of their planned Parallel Inference 
Machine, they also attempted to imple¬ 
ment its OR-parallel aspect. Prototype im¬ 
plementations of three different schemes 
were constructed, namely shallow binding 
[Miy85], deep binding, and lazy copying 
(the scheme we tried at Weizmann). Shal¬ 
low binding proved to be the fastest, but 
did not seem to scale to multiprocessors. 
Lazy copying was the slowest, so the 
choice seemed to fall on deep binding. Un¬ 
fortunately the implementation scheme 
was rather complex, and the subtle pro¬ 
blems with Concurrent Prolog’s OR- 
parallelism remained unsolved. On the 
other hand, ICOT did not want to follow 
the Flat Concurrent Prolog path, since it 
seemed to take them even further away 
from Prolog and from the AI applications 
envisioned for the Parallel Inference 
Machine. 

An elegant solution to these problems 
was found in Guarded Horn Clauses 
[Ued85], a novel concurrent logic pro¬ 
gramming language. The main design 


choice of GHC was to eliminate multiple 
OR-parallel environments from Concur¬ 
rent Prolog. Besides avoiding a major im¬ 
plementation problem, this decision also 
provided a synchronization rule: If you try 
to write on the parent environment, then 
suspend. (In Concurrent Prolog, a process 
would allocate a local copy of the variable 
and continue instead.) This rule made the 
read-only annotation somewhat super¬ 
fluous. The resulting language exhibits 
elegance and conciseness, and seems to 
capture most of Concurrent Prolog’s ap¬ 
plications and programming techniques, 
excluding, of course, Kahn’s OR-parallel 
Prolog interpreter. GHC is ICOT’s cur¬ 
rent choice for Kernel Language 1. Besides 
solving some of the difficulties in the 
definition and implementation of Concur¬ 
rent Prolog, GHC is “Made in Japan,” 
certainly not a disadvantage from ICOT’s 
point of view. Recent implementation ef¬ 
forts at ICOT concentrate on Flat GHC, 
the GHC analog to Flat Concurrent 
Prolog. 

So why didn’t we switch to GHC? Long 
discussions were carried out among our 
group about this option. Our general con¬ 
clusion was that even though GHC is a 
simpler formalism, it is also more fragile, 
less expressive, and more difficult to ex¬ 
tend. We felt it would either break or lose 
much of its elegance when faced with the 
problems of implementing a real operating 
system, which includes a secure kernel, 
error handling for user programs, and dis¬ 
tributed termination and deadlock detec¬ 
tion. Furthermore, it would be less ade¬ 
quate for AI applications, since it has a 
weaker notion of unification. 

Another related research effort is the 
development of the Parlog programming 
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language by Clark and Gregory at Imper¬ 
ial College [ClGr86], Parlog is compiler- 
oriented, even more than GHC, in a way 
that seems to render it unsuitable for meta¬ 
programming. Given our commitment to 
implement the entire programming en¬ 
vironment and operating system around 
the concepts of meta-interpretation and 
partial evaluation, we cannot use Parlog. 
On the performance side, Parlog and 
GHC seem quite similar, except that GHC 
has to make a runtime check that guards 
do not write on the parent’s environment, 
whereas Parlog ensures this at compile 
time, using what is called a safety check. 
On the expressiveness side, there does not 
seem to be a great difference between Par- 
log and GHC, except for meta-pro¬ 
gramming. 

Alternative synchronization constructs 
to the read-only variable were proposed by 
Saraswat [Sar85] and by Ramakrishnan 
and Silberschatz [Ram86]. 


Current research 
directions 

The main focus of our current research 
at the Weizmann Institute is the im¬ 
plementation of a Concurrent Prolog- 
based general-purpose parallel computer 
system. Our present implementation vehi¬ 
cle is Intel’s iPSC d4/me, a memory- 
enhanced four-dimensional hypercube, 
which, incidentally, is isomorphic to a 
4x4 mesh-connected torus. As a first 
step, a distributed FCP interpreter was im¬ 
plemented in C, based on a distributed 
unification algorithm that guarantees the 
atomicity of goal reductions [Tay85]. 
Also, a technique for implementing Con¬ 
current Prolog virtual machines that 
manage code and process mapping on top 
of the physical machine has been 
developed [Tay86]. 

Since Logix is self-contained, once the 
abstract FCP machine runs on a parallel 
computer, an entire program development 
environment and operating system will 
also become available on it. For example, 
the Logix source-level debugger, as well as 
other meta-interpreter-based tools such as 
a profiler, would preserve the parallelism 
of the interpreted program while executing 
on a parallel computer. So, with this sys¬ 
tem a parallel computer could be used 
both as the development machine and as 


the target machine, which is clearly advan¬ 
tageous over the sequential front-end/ 
parallel back-end machine approach. 
Since source text, parsed code, and com¬ 
piled code are first-class objects in Logix, 
routines that implement code-manage¬ 
ment algorithms on the parallel computer 
are written in Concurrent Prolog itself 
[Tay86]. 

A technique for compiling Concurrent 
Prolog into Flat Concurrent Prolog was 
developed [Cod86]. It involves writing a 
Concurrent Prolog interpreter in Flat 
Concurrent Prolog and then partially 
evaluating it 15 with respect to the program 
to be compiled. It avoids the dynamic 
multiple-environment problem by requir¬ 
ing static output annotations on variables 
to be written upon. An attempt to provide 
Concurrent Prolog with precise semantics 
is also being made, following initial work 
by Levi and Palamidessi [Lev85] and 
Saraswat [Sar85], 

Another research direction pursued is 
partial evaluation [SafSh86], a technique 
of program transformation and optimiza¬ 
tion that proves very versatile when com¬ 
bined with heavy usage of interpreters and 
meta-interpreters [Hir86], [Sil86], as in 
Logix. 

We believe that parallel execution is not 
a substitute for, but rather is dependent 
upon, efficient uniprocessor implementa¬ 
tion. To that effect, a high-performance 
FCP compiler is being developed. Hand 
timings indicate expected performance of 
about 30K LIPS for a 10-MHz 68010. 

Logix itself is still under development. 
Short term extensions include a hierar¬ 
chical module system and a window sys¬ 
tem. Longer term research includes ex¬ 
tending it to a multiprocessor/multiuser 
operating system. 

O ur research on Concurrent Pro¬ 
log has demonstrated that a high- 
level logic programming lan¬ 
guage can express conveniently a wide 
range of parallel algorithms. The perfor¬ 
mance of the Logix system demonstrates 
that a side-effect-free language based on 
lightweight processes can be practical even 
on conventional uniprocessors. It thus de¬ 
bunks the expensive process spawn myth. 
Its functionality and pace of development 
testify that Concurrent Prolog is a usable 
and productive systems programming 
language. 


We have yet to demonstrate the prac¬ 
ticality of Concurrent Prolog for pro¬ 
gramming parallel computers. We find the 
ultimate and most important question to 
be: Which of the currently proposed ap¬ 
proaches will result in a scalable parallel 
computer system whose generality of ap¬ 
plications, ease of use, and cost/perfor¬ 
mance ratio in terms of both hardware and 
software can compete favorably with ex¬ 
isting sequential computers? 

Until such a system is demonstrated, the 
question of parallel processing cannot be 
considered solved. □ 
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rithm underlying the FCP implementation on 
the iPSC hypercube. 

[Tam85] 

Tamaki, H., “A Distributed Unification 
Scheme for Systolic Logic Programs,” Proc. of 
the 1985 Internat’l Conf. on Parallel Process¬ 
ing, IEEE, 1985, pp. 552-559. 

A unification algorithm for a distributed 
machine. Supports only a limited class of con¬ 
current logic programs. 

[UedChi85] 

Ueda, K., and T. Chikayama, “Concurrent 
Prolog Compiler on Top of Prolog,” 1985 
Symp. on Logic Programming, IEEE Com¬ 
puter Society, July 1985, pp. 119-126. 

A compiler based on the interpretation algo¬ 
rithm of Shapiro’s “A Subset of Concurrent 
Prolog and its Interpreter” ([Sha83]), which 
achieves an impressive performance. 

Concurrent Prolog applications 
[Ede85] 

Edelman, S., and E. Shapiro, “Quadtrees in 
Concurrent Prolog,” Proc. of the Internat’l 


Conf. on Parallel Processing, IEEE Computer 
Society, Aug. 1985, pp. 544-551. 

Describes quad-tree-baSed algorithms and 
their implementation in Concurrent Prolog. 

[Fur84] 

Furukawa, K., et al., “Mandala: A Logic- 
based Knowledge Programming System,” 
Proc. ofFGCS ‘84, Tokyo, Japan, 1984, pp. 
613-622. 

A LOOPS-like object-oriented system and 
its implementation in Concurrent Prolog. 

[Hel84] 

Hellerstein, L., and E. Shapiro, “Implement¬ 
ing Parallel Algorithms in Concurrent Prolog: 
The M AXFLO W Experience, ’ ’ Proc. of the In¬ 
ternat’l Symp. on Logic Programming, Atlan¬ 
tic City, New Jersey, Feb. 1984. 

A complicated PRAM algorithm is made 
single-assignment without loss of efficiency and 
implemented in Concurrent Prolog. 

[Suz86] 

Suzuki, N., ‘ ‘Experience with Specification and 
Verification of Complex Computer Hardware 
Using Concurrent Prolog,” Logic Program¬ 
ming and its Applications, eds. D. H. D. War¬ 
ren and M. van Caneghem, Ablex, 1986. 

A proposal for using Concurrent Prolog as 
an executable hardware description language. 
The results compare favorably with the 
author’s previous experience with a more con¬ 
ventional formalism he has developed. 

[Tayl86] 

Taylor, S., et al., “Notes on the Complexity of 
Systolic Programs,” Weizmann Institute 
technical report CS86-16, 1986. 

The complexity of several Concurrent Prolog 
implementations of systolic algorithms is 
analyzed and standard techniques for improv¬ 
ing communication complexity are identified. 
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This methodology 
treats a 

multiprocessor as a 
single autonomous 
computer onto which 
a program is 
mapped, rather than 
as a group of 
independent 
processors. 


T he importance of parallel computing 
hardly needs emphasis. Many physi¬ 
cal problems and abstract models 
are seriously compute-bound, since se¬ 
quential computer technology now faces 
seemingly insurmountable physical limita¬ 
tions. It is widely believed that the only 
feasible path toward higher performance 
is to consider radically different computer 
organizations, in particular ones ex¬ 
ploiting parallelism. This argument is 
indeed rather old now, and considerable 
progress has been made in the construc¬ 
tion of highly parallel computers. 

One of the simplest and most promising 
types of parallel machines is the well- 
known multiprocessor architecture, a col¬ 
lection of autonomous processors with 
either shared or distributed memory that 
are interconnected by a homogeneous 
communications network and usually 
communicate by sending messages. The 
interest in machines of this type is not sur¬ 
prising, since not only do they avoid the 
classic “von Neumann bottleneck” by 
being effectively decentralized, but they 
are also extensible and in general quite 
easy to build. Indeed, more than a dozen 
commercial multiprocessors either are 
now or will soon be available. 

Although designing and building multi¬ 
processors has proceeded at a dramatic 
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pace, the development of effective ways to 
program them has generally not. This is an 
unfortunate state of affairs, since ex¬ 
perience with sequential machines tells us 
that software development, not hardware 
development, is the most critical element 
in a system’s design. The immense com¬ 
plexity of parallel computation can only 
increase our dependence on software. 
Clearly we need effective ways to program 
the new generation of parallel machines. 

In this article I introduce para-function¬ 
al programming, a methodology for 
programming multiprocessor computing 
systems. It is based on a functional pro¬ 
gramming model augmented with features 
that allow programs to be mapped to 
specific multiprocessor topologies. The 
most significant aspect of the methodol¬ 
ogy is that it treats the multiprocessor as a 
single autonomous computer onto which a 
program is mapped, rather than as a group 
of independent processors that carry out 
complex communication and require com¬ 
plex synchronization. In more conven¬ 
tional approaches to parallel program¬ 
ming, the latter method of treatment is 
often manifested as processes that co¬ 
operate by message-passing. However, 
such notions are absent in para-functional 
programming; indeed, a single language 
and evaluation model can be used from 
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problem inception, to prototypes targeted 
for uniprocessors, and ultimately to reali¬ 
zations on a parallel machine. 

Functional 
programming and 
parallel computing 

The future of parallel computing de¬ 
pends on the creation of simple but effec¬ 
tive parallel-programming models (re¬ 
flected in appropriate language designs) 
that make the details of the underlying 
architecture transparent to the user. Many 
researchers feel that conventional im¬ 
perative languages are inadequate for such 
models, since these languages are intrin¬ 
sically tied to the “word-at-a-time” von 
Neumann machine model. 1 Extending 
such a sequential model to the parallel 
world is like putting on a shoe that doesn’t 
fit. It makes more sense to use a language 
with a nonsequential semantic base. 

One of the better candidates for parallel 
computing is the class of functional lan¬ 
guages (also known as applicative or 
dataflow languages). In a functional lan¬ 
guage, no side effects (such as those 
caused by an assignment statement) are 
permitted. The lack of side effects ac¬ 
counts at least partially for the well-known 
Church-Rosser Property, which essential¬ 
ly states that no matter what order of com¬ 
putation is chosen in executing a program, 
the program is guaranteed to give the same 
result (assuming termination). This mar¬ 
velous determinacy property is invaluable 
in parallel systems. It means that programs 
can be written and debugged in a func¬ 
tional language on a sequential machine, 
and then the same programs can be exe¬ 
cuted on a parallel machine for improved 
performance. The key point is that in 
functional languages the parallelism is im¬ 
plicit and supported by their underlying 
semantics. There is generally no need for 
special message-passing constructs or 
other communications primitives, no need 
for synchronization primitives, and no 
need for special “parallel” constructs such 
as “parbegin...parend.” 

On the other hand, doing without as¬ 
signment statements seems rather radical. 
Yet clearly the assignment statement is an 
artifact of the von Neumann computer 
model and is not essential to the most 
abstract form of computation. In fact, a 


major goal of high-level language design 
has been the introduction of expressions, 
which transfer the burden of generating 
sequential code involving assignments 
from programmer to compiler. Functional 
languages simply carry this goal to the ex¬ 
treme: Everything is an expression. The 
advantages of the resulting programming 
style have been well-argued elsewhere, 1-2 
and will not be repeated here. However, I 
wish to emphasize the following point: 
Although most experienced programmers 
recognize the importance of minimizing 
side effects, the importance of doing so in 
a parallel system is intensified significant¬ 
ly, due to the careful synchronization re¬ 
quired to ensure correct behavior when 
side effects are present. Without side 
effects, there is no way for concurrent por¬ 
tions of a program to affect one another 
adversely—this is simply another way of 
stating the Church-Rosser Property. 

The use of functional languages for par¬ 
allel programming is really nothing new. 
Such use has its roots in early work on 
dataflow and reduction machines, in the 
course of which many functional lan¬ 
guages were developed simultaneously 
with the design of new parallel architec¬ 
tures. Consider, for example, J. B. Den¬ 
nis’s dataflow machine and the language 
VAL, Arvind’s U-interpreter and the lan¬ 
guage ID, A. L. Davis’s dataflow machine 
DDM1 and the language DDN, and R. M. 
Keller’s reduction machine AMPS and the 
language FGL. Such work on automati¬ 
cally decomposing a functional program 
for parallel execution continues today, and 
includes Rediflow 3 and my own work on 
serial combinators. 4 

The aforementioned systems automati¬ 
cally extract parallelism from a program 
and dynamically allocate the resultant 
tasks for parallel execution. But what 
about a somewhat different scenario— 
one in which the programmer knows the 
optimal mapping of his or her program 
onto a particular multiprocessor? One 
cannot expect an automated system to 
determine this optimal mapping for all 
program-processor combinations, so it is 
desirable to provide the user with the abili¬ 
ty to express the mapping explicitly. (The 
need for this ability often arises, for exam¬ 
ple, in scientific computing, where many 
classic algorithms have been redesigned 
for optimal performance on particular 
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machines.) As it stands, almost no lan¬ 
guages provide this capability. 

ParAlfl is a functional language that 
provides a simple yet powerful mechanism 
for mapping a program onto an arbitrary 
multiprocessor. The mapping is ac¬ 
complished by annotating subexpressions 
so as to show the processor on which they 
will be executed. With annotations, the 
mapping can be done in such a way that 
the program’s functional behavior is not 
altered; that is, the program itself re¬ 
mains unchanged. The resulting method¬ 
ology is referred to as para-functional 
programming, since it provides not only a 
much-needed tool for expressing parallel 
computation, but also an operational se¬ 
mantics that is truly “extra,” or “beyond” 
the functional semantics of the program. 
It is quite powerful, for several reasons: 

• It is very flexible. Not only is para- 
functional programming easily adapted to 
any functional language, but also any net¬ 
work topology can be captured by the 
notation, since no a priori assumptions are 
made about the structure of the physical 
system. All the benefits of conventional 
scoping disciplines are available to create 
modular programs that conform to the 
topology of a given machine. 

• The annotations are natural and con¬ 
cise. There are no special control con¬ 
structs, no message-passing constructs, 
and in general no forms of “excess bag¬ 
gage” to express the rather simple notion 
of where and when to compute things. 

• With some minor constraints, if a 
para-functional program is stripped of its 
annotations, it is still a perfectly valid 
functional program. This means that it 
can be written and debugged on a unipro¬ 
cessor that ignores the annotations, and 
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An important 
semantic feature of 
ParAlfl is lazy 
evaluation. 


then executed on a parallel processor for 
increased performance. Portability is 
enhanced, since only the annotations need 
to change when one moves from one par¬ 
allel topology to another (unless the algo¬ 
rithm itself changes). The ability to debug 
a program independently of the parallel 
machinery is invaluable. 

ParAlfl: a simple 
para-functional 
programming language 

ParAlfl forms the testbed of Yale’s 
para-functional programming research. It 
was derived from a functional language 
called ALFL, 5 which is similar in style to 
several modern functional languages, in¬ 
cluding SASL (and its successors KRC and 
Miranda), 6 FEL, 7 and Lazy ML. 8 

The base language. To make ParAlfl ac¬ 
cessible to a broader audience, the base 
language, as shown here, was changed 
somewhat; for example, the arguments in 
function calls are “tupled” rather than 
“curried.” The interested reader is re¬ 
ferred to Henderson 2 and to Darlington, 
Henderson, and Turner 9 for a more 
thorough treatment of the functional pro¬ 
gramming paradigm. The salient features 
of the base language are 

• Block-structuring is used, and takes 
the form of an equation group with the 
following configuration: 

(f 1 (X J x k j) = = exp,; 

f 2 (x,,...,x k2 ) = = exp 2 ; 

result exp; 

f n(Xi,...,x kn ) = = exp n ] 


An equation group is simply a collection 
of mutually recursive equations (each 
defining a local identifier) together with a 
single result clause that expresses the value 
to which the equation group will evaluate 
0 result is a reserved word). Equation 
groups are just expressions, and can thus 
be nested to an arbitrary depth. 

• A double equal-sign (“ = = ”) is used 
to distinguish equations from Boolean ex¬ 
pressions of the form expl = exp2. The 
argument list is optional, allowing defini¬ 
tions of simple values, such as x = = exp. 
Since the equations are mutually recur¬ 
sive, and since ParAlfl is a lazy functional 
language, the order of the equations is 
irrelevant. All values are essentially eval¬ 
uated “on demand.” 

• As in Lisp, the list is a fundamental 
data structure in ParAlfl. The operators 

, ", hd, and tl are like cons, append, 
car, and cdr, respectively, in Lisp. * and 
build lists: a ‘ l is the list whose first 
element is a and the rest is just the list /, and 
11 “ 72 is the list resulting from appending 
the lists 11 and 12 together. Hd and tl 
decompose lists: hd(a‘l) returns a, and 
tl(a'l) returns /. A proper list (one ending 
in “nil”) can be constructed with square 
brackets, as in [ a,b,c ], which is equivalent 
to </7>'c'[] (' is right associative). Lists 
are constructed lazily. 

• ParAlfl has functional arrays. The 
equation v = = mka(d,f) (mka is short 
for ‘‘make array’’) defines a vector void 
values, indexed from 1 to d, such that the 
ith element v[i\ is the same as /(/). Gen¬ 
erally, the equation a = = mka(dl,d2, 
...,dnj) defines an n-dimensional array a 
such that a[il,...,in] = f(il,...,in). Ar¬ 
rays are constructed lazily, although the 
elements are computed in parallel. (See the 
section entitled “Eager Expressions,” 
below.) In an earlier article on para- 
functional programming 10 arrays were 
defined as being non-lazy, or strict. In 
reality, both kinds of array construction 
are provided in ParAlfl. 

An important semantic feature of 
ParAlfl is lazy evaluation.* That is, ex¬ 
pressions are evaluated on demand instead 
of according to some syntactic rule, such 


•Lazy evaluation is closely related to the call-by-name 
semantics of Algol, but is different in that once an ex¬ 
pression is computed, its value is retained. In function 
calls, lazy evaluation is sometimes referred to as call- 
by-need evaluation. 


as the order of identifier bindings. For ex¬ 
ample, one can write 



f(x,y) = = if p then y else x + y j 
Note that a depends on b, yet is defined 
before b. Indeed, the order of these equa¬ 
tions and the result clause is totally irrele¬ 
vant. Note further that the function/does 
not use its first argument if p is true. Thus, 
in the call f(a,b ), the argument a is never 
evaluated (that is, the multiplication b*b 
never happens) if p is true. 

An often highlighted feature of lazy 
evaluation is its ability to express un¬ 
bounded data structures, or infinite lists. 
For example, an infinite list of the squares 
of the natural numbers can be defined by 

[ result squares(0); 

squares(n) = = n*n ‘ squares(n +1)) 
However, an important but often over¬ 
looked advantage of lazy evaluation is 
simply that it frees the programmer from 
extraneous concerns about the order of 
evaluation of expressions. Being freed 
from such concerns is very liberating for 
programming in general, but is especially 
important in parallel programming 
because over-specifying the order of eval¬ 
uation can limit the potential parallelism. 

Mapped expressions. A program can be 
mapped onto a particular multiprocessor 
architecture through the use of mapped 
expressions. These form one of the two 
classes of extensions (annotations) to the 
base language. (The other class is made up 
of eager expressions, which are described 
below.) Mapped expressions have the sim¬ 
ple form 

exp Son proc 

which declares that exp is to be computed 
on the processor identified by proc (on 
proc is prefixed with $ to emphasize that 
Son proc is an annotation). The expression 
exp is the body of the mapped expression, 
which is to say, it represents the value to 
which the overall expression will evaluate 
(and thus can be any valid ParAlfl expres¬ 
sion, including another mapped expres¬ 
sion). The expression proc must evaluate 
to a processor ID. Without loss of general¬ 
ity, we will assume in all examples below 
that processor IDs, or pids, are integers 
and that there is some predefined mapping 
from those integers to the physical pro¬ 
cessors they denote. For example, a tree of 
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processors might be numbered as shown in 
Figure 1 (a) and a mesh as shown in Figure 
1(b). The advantage of using integers is 
that the user can manipulate them with 
conventional arithmetic primitives; for ex¬ 
ample, Figure 1 also defines functions that 
map pids to neighboring pids. However, a 
safer discipline might be to define a pid as 
a unique data-type, and to provide primi¬ 
tives that enable the user to manipulate 
values having that type. 

Simple examples of mapped expres¬ 
sions. Consider the program fragment 

f(x) + g(y) 

The strict semantics of the -I- operator 
allows the two subexpressions to be 
evaluated in parallel. If we wish to express 
precisely where the subexpressions are to 
be evaluated, we can do so by annotating 
them, as in 

(f(x) Son 0) + (g(y) Son 1) 
where 0 and 1 are processor IDs. 

Of course, this static mapping is not 
very interesting. It would be nice, for ex¬ 
ample, if we were able to refer to a pro¬ 
cessor with respect to the currently exe¬ 
cuting one. ParAlfl provides this ability 
through the reserved identifier Sself, 
which when evaluated returns the pid of 
the currently executing processor. Using 


Sself we can be more creative. For exam¬ 
ple, suppose we have a mesh or tree of pro¬ 
cessors as shown in Figure 1; we can then 
write 

(f(x) Son left(Sself)) + 

(g(y) Son right(Sself)) 

to denote the computation of the two 
subexpressions in parallel on neighboring 
processors, with the sum being computed 
on %self. 

We can describe the behavior of %self 
more precisely as follows: $ self is bound 
implicitly by mapped expressions; thus, in 
exp Son pid 

Sself has the value pid in exp , unless it is 
further modified by a nested mapped ex¬ 
pression. Although Sself is a reserved 
word that cannot be redefined, this im¬ 
plicit binding can be best explained with 
the following analogy: 

exp Son pid 
is like 

[ Sself = = pid; result exp J 
However, the most important aspect of 
Sself is that it is dynamically bound in 
function calls. Thus, in 

(result (f(a) Son pidj) + (f(b) Son pid 2); 
f(x) = = x*x ) Son pid 3 
a * a is computed on processor pid i,b*b 
on processor pid 2 , and the sum on pro- 


Figure 1. Two possible 
network topologies: 
infinite binary tree (a), 
and finite mesh of size 
mxn (b). Listed with 
each topology are 
functions that map pids 
to neighboring pids. 

cessor pid 3 . As before, an analogy is 
useful in describing this behavior: 
f(x,y,z,) = = exp; 

... f(a,b,c)... 
is like 

f(x,y,z,$self) = = exp; 

... f(a,b,c,$self)... 

In other words, all functions implicitly 
take an extra formal parameter, Sself and 
all function calls use the current value of 
Sself as the value for the new actual pa¬ 
rameter. 

Although very powerful, $ self is not 
always needed. Particular cases illustrat¬ 
ing this are those in which mappings can be 
made from composite objects, such as vec¬ 
tors and arrays, to specific multiprocessor 
configurations. For example, if / is de¬ 
fined by /(/) = = i* *2 Son i, then the 
call mka(nj) will produce a vector of 
squares, one on each of n processors, such 
that the rth processor contains the ith ele¬ 
ment (namely i 2 ). Further, suppose we 
have two vectors v and w and we wish to 
create a third that is the sum of the other 
two, but distributed over the n processors. 
This can be done very simply by 
mka(n,g); 

g(i) = = (v[i] + w[i]) Son i 
If t; and w were already distributed in the 
same way, this would express the pointwise 
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parallel summation of two vectors on n 
processors. 

A note on lexical scoping arid data 
movement. Consider the following typical 
situation: A shared value v is to be com¬ 
puted for use in two independent subex¬ 
pressions, el and e2\ the values of these 
subexpressions are then to be combined 
into a single result. In a conventional lan¬ 
guage one might express this as something 
like 

begin v : = code-for-v; 
el := code-for-el; 

(Comment: uses v) 
e2 : = code-for-e2! 

(Comment: uses v) 
result := combine(el,e2); 

end; 

and in ParAlfl one might write 

| v = = code-for-v; 

el = = code-for-el; (Comment: uses v) 
e2 = = code-for-e2; (Comment: uses v) 
result combine(el,e2)) 

Both of these programs are very clear and 
concise. 

But now suppose that this same com¬ 
putation is to begin and end on processor 
P, and the subexpressions el and e2 are to 
be executed in parallel on processors q and 
r, respectively. In a conventional language 
augmented with explicit process-creation 
and message-passing constructs, one 
might write the following program: 

process PO; 
v : = code-for-v; 
send(v,Pl); 
send(v,P2); 
el := receive(Pl); 
e2 : = receive(P2); 
result := combine(el,e2) 

end-process; 


process Pi; 
v : = receive(PO); 

el := code-for-el; (Comment: uses v) 
send(el,P0); 
end-process; 

process P2; 
v : = receive(PO); 

e2 : = code-for-e2; (Comment: uses v) 
send(e2,P0); 
end-process; 

which is then actually run by executing 
something like 

invoke PO on processor p; 
invoke PI on processor q; 
invoke P2 on processor r; 

Note that the structure of the original pro¬ 
gram has been completely destroyed. Ex¬ 
plicit processes and communications 
between them have been introduced to 
coordinate the parallel computation. The 
semantics of both the process-creation and 
communications constructs need to be 
carefully defined before the run-time 
behavior can be understood. This pro¬ 
gram is no longer as clear nor as concise as 
the original one. 

On the other hand, a ParAlfl program 
for this same task is simply 
| v = = code-for-v; 
el = = code-for-el Son q; 

(Comment: uses v) 
e2 = = code-for-e2 Son r; 

(Comment: uses v) 
result combine(el,e2) j Son p 
Note that if the three annotations are 
removed, the program is identical to the 
ParAlfl program given earlier! No com¬ 
munications primitives or special syn¬ 
chronization constructs are needed to send 
the value of v to processors q and r; stan¬ 
dard lexical scoping mechanisms ac¬ 
complish the data movement naturally 
and concisely. The values of el and e2 are 
sent back to processor p in the same way. 

Eager expressions. The second form of 
annotation, the eager expression, arises 
out of the occasional need for the pro¬ 
grammer to override the lazy-evaluation 
strategy of ParAlfl, since normally 
ParAlfl does not evaluate an expression 
until absolutely necessary. (This second 
type of annotation is not needed in a func¬ 
tional language with non-lazy semantics, 
such as pure Lisp, but as mentioned 
earlier, we prefer the expressiveness af¬ 
forded by lazy semantics.) An eager ex¬ 
pression has the simple form 
#exp 


which forces the evaluation of exp in par¬ 
allel with its immediately surrounding syn¬ 
tactic form, as defined below: 

If It exp appears as 

• an argument to a function (for exam¬ 
ple, f{x,#y,z)), then it executes in 
parallel with the function call. 

• an arm of a conditional (for example, 
if p then #x elsey), then it executes in 
parallel with the conditional. 

• an operand of an infix operator (for 
example, x~tty; another example is 
x and #y), then it executes in parallel 
with the whole operation. 

• an element of a list (for example, 
[x,tfy,z]), then it executes in parallel 
with the construction of the list. 

Thus, for example, in the expression if p 
then f(Hx,y) else z, the evaluation of x 
begins as soon as p has been determined to 
be true, and simultaneously the function/ 
is invoked on its two arguments. Note that 
the evaluation of some subexpression 
begins when any expression is evaluated, 
and thus to evaluate that subexpression 
“eagerly” accomplishes nothing. For ex¬ 
ample, note the following equivalences: 

if #p then x else y = if p then x else y 

#x and y * x and y 

#x + #y = x + y 

A special case of eager computation oc¬ 
curs in the construction of arrays, which 
are almost always used in a context where 
the elements are computed in parallel. 
Because of this, the evaluations of the 
elements of an array are defined to occur 
eagerly (and in parallel, of course, if ap¬ 
propriately mapped). 

Eager expressions are commonly used 
within lists. Consider, for example, the ex¬ 
pression [x.tty]; normally lists are con¬ 
structed lazily in ParAlfl, so the values of x 
and y are hot evaluated until selected. But 
with the annotation shown, y would be 
evaluated as soon as the list was de¬ 
manded. As with arrays, however, the ex¬ 
pression does not wait for the value of y to 
return a fully computed value. Instead, it 
returns a partially constructed list just as it 
would with lazy evaluation. 

The above discussion leads us to an im¬ 
portant point about eager expressions: 
The value of an eager expression is that of 
the expression without the annotation. As 
with mapped expressions, the annotation 
only adds an operational semantics, and 
thus the user can invoke a nonterminating 
subcomputation, yet have the overall pro- 
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[ result pfac(1,k) $on root; 

pfac(lo.hi) = =if lo=hi then lo 

else it lo=(hi —1) then lo*hi 

else ( result (pfac(lo,mid) Son left(Sself)) * 

(pfac(mid + 1,hi) Son right(Sself)); 
mid = =(lo+hi)/2 j; 

left(pe) = =if 2*pe>n then pe else 2*pe; 
right(pe) = =if 2 *pe>n then pe else 2*pe + l; 
root = = 1; 



gram terminate. Indeed, in the above ex¬ 
ample, even should y not terminate, if only 
the first element of the list is selected for 
later use, the overall program may still ter¬ 
minate properly. The “runaway process” 
that computes y is often called an irrele¬ 
vant task, and there exist strategies for 
finding and deleting such tasks at run 
time. Such considerations are beyond the 
scope of this article, although it should be 
pointed out that given an automatic task- 
collection mechanism there are real situa¬ 
tions in which one may wish to invoke a 
nonterminating computation (an example 
of this is given in Hudak and Smith 10 ). 

A note on determinacy. All ParAlfl pro¬ 
grams possess the following determinacy 
property: 

A ParAlfl program in which (1) the 
identifier $self appears only in pid ex¬ 
pressions, and (2) all pid expressions 
terminate without error, is functionally 
equivalent to the same program with all 
of the annotations removed. That is, 
both programs return the same value. 
(A formal statement and proof of this 
property depends on a formal denota- 
tional semantics for ParAlfl, which is 
beyond the scope of this article, but such 
semantics can be found in Hudak and 
Smith. 10 ) 

The reason for the first constraint is that 
if the mapping annotations are removed, 
all remaining occurrences of $ve//have the 
same value, namely the pid of the root pro¬ 
cessor. Thus, removing the annotations 
may change the value of the program. The 
purpose of the second constraint should 
be obvious: If the system diverges or errs 
when determining the processor on which 
to execute the body of a mapped expres¬ 
sion, then it will never get around to com¬ 
puting the value of that expression. 

Although neither determinacy con¬ 
straint is severe, there are practical reasons 
for wanting to violate the first one (that is, 
for wanting to use the value of %self in 
other than a pid expression). The most 
typical situation where this arises is in a 
nonisotropic topology where certain pro¬ 
cessors form a boundary for the network 
(for example, the leaf processors in a tree, 
or the edge processors in a mesh). There 
are many distributed algorithms whose 
behavior at such boundaries is different 
from their behavior at internal nodes. To 
express this, one needs to know when exe¬ 


cution is occurring at the boundary of the 
network, which can be conveniently deter¬ 
mined by analyzing the value of $ self. 

Sample application 
programs 

In this section two simple examples are 
presented that highlight the key aspects of 
para-functional programming. Space 
limitations preclude the inclusion of ex¬ 
amples that are more complex, but some 
can be found in Hudak and Smith 10 and 
Hudak. 11 

Parallel factorial. Figure 2 shows a sim¬ 
ple parallel factorial program annotated 


Figure 2. Divide-and- 
conquer factorial on 
finite tree. 


Figure 3. Dataflow for 
parallel factorial. 

for execution on a finite binary tree of n = 
2 d — l processors. Although computing 
factorial, even in parallel, is a rather sim¬ 
ple task, the example demonstrates several 
important ideas, and most other divide- 
and-conquer algorithms could easily fit 
into the same framework. 

The algorithm is based on splitting the 
computation into two parts at each itera¬ 
tion and mapping the two subtasks onto 
the “children” of the current processor. 
Note that through the normal lexical scop¬ 
ing rules, mid will be computed on the cur¬ 
rent processor and passed to the child pro¬ 
cessors as needed (recall the discussion in 
the section on “Mapped expressions,” 
above). The functions left and right 
describe the network mapping necessary 
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Figure 6. Dataflow for matrix problem. (“BS” stands for backsubstitution.) 


for this topology, and Figure 3 shows the 
process-mapping and flow of data be¬ 
tween processes when k=5. 

Note that the program in Figure 2 obeys 
the constraints required for determinacy, 
and thus the program returns the same 
value regardless of the annotations. Note 
further that with the mapping used, when 
processing reaches a leaf node all further 
calls to pfac are executed on the leaf pro¬ 
cessor. Routing functions of greater com¬ 
plexity could be devised that, for example, 
would reflect the computation upward 
once a leaf processor is reached. Alter¬ 
natively, it might be desirable to use a more 
efficient factorial algorithm at the leaf 
nodes. An example of this is given in 
Figure 4, where the tail-recursive function 
sfac is invoked at the leaves. Determining 
that execution has reached a leaf processor 
requires inspection of Sself, and thus the 
determinacy constraints are violated, yet 
the program still returns the same value 
regardless of the annotations. This con¬ 
stancy of values is, of course, often the 
case, but it cannot be guaranteed in 
general without the previously discussed 
constraints. 

Solution to upper triangular block 
matrix. The next example is typical of 
problems encountered in scientific com¬ 
puting: The problem is to solve for the 
vector x in the matrix equation Ux=b, 
where U is an upper triangular block 
matrix (that is, a matrix whose elements 
are themselves matrices, and whose ele¬ 
ments below the main diagonal contain all 
zeros). Algorithms using block matrices 
are especially suited to multiprocessors 
with nontrivial communications costs, 
since typically the subcomputations in¬ 
volving the submatrices can be done in 
parallel with little communication be¬ 
tween the processors. 


Key: 


Data (and temporal) dependency 
Temporal dependency imposed by pipelining 


Processor 5 Compute x 5 

Processor 4 Back-substitute x 5 =» Compute x 4 

Processor 3 Back-substitute x 5 => Back-substitute x 4 => Compute x 3 
l I u 

Processor 2 Back-substitute x 5 => Back-substitute x 4 => Back-substitute x 3 => Compute x, 
•lie 
Processor 1 Back-substitute x 5 « Back-substitute x 4 => Back-substitute x 3 » Back-substitute x 2 =» Compute x, 


Figure 7. Pipelining data for matrix problem. 
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I result xvect; 
xvect= =mka(n,x); 

x(i) = = { result (b[i]—sum(n.O)) / U[i,i]; 

sum(j ,acc) = = if j < i +1 then acc 

else sum(j -1 ,acc+xpipe[i][n -j +1 ]*U[i,j] 

) Son i; 

xpipe = =mka(n,xfn); 

xfn(i) = = | result mka(n—i +1 .xlocal); 

xlocal(j) = = if j=n - i +1 then xvect[i] 
else xpipe[i + 1][j] | Son i 

) 


Figure 8. A program for pipelining x, around a ring of processors. 



xpipe[5] (proc 5) 

‘4| xpipe[4] (proc 4) 


1-dimensional array 


x 3 I xpipe[3] (proc 3) 


71 xpipe[2] (proc 2) 


J xpipefl] (proc 1) 


Figure 9. A vector pipeline for x. 


If we ignore parallelism at the moment 
and concentrate instead on a functional 
specification of this problem, it is easy to 
see from basic linear algebra that each ele¬ 
ment Xj in the solution vector x (of length 
n) can be given by the following equation: 

*/= (b, - £ XjUij) / U u 

j=n 

where we assume for convenience that the 
submatrices are of unit size (and are thus 
represented simply as scalar quantities). 
Given this equation for each element, it is 
easy to construct the solution vector in 
ParAlfl, as shown in Figure 5. 

This problem, as it is, has plenty of par¬ 
allelism. To see this, look at Figure 6, a 
dataflow graph showing the data depen¬ 
dencies when n=5. Clearly, once an ele¬ 
ment of the solution is computed, all of the 
backsubstitutions of it can be done in par¬ 
allel; that is, each of the horizontal 
“steps” in Figure 6 can be executed. This 
parallelism derives solely from the data 
dependencies inherent in the problem, and 
is mirrored faithfully in the ParAlfl code. 
Indeed, if we have n processors sharing a 
common memory, we can annotate the 
program in Figure 5 very simply: 

(result xvect; 
xvect = = mka(n,x); 
x(i) ==(...] Son i 

] 

where “...” denotes the same expression 
used in Figure 5 for x(z'). (Recall that the 
elements of an array are computed in par¬ 
allel, and thus do not require eager an¬ 
notations.) 

But let us consider topologies that are 
more interesting. Consider, for example, a 
ring of n processors. Although the 
topology of a ring is simple, its limited 
capacity for interprocessor communica¬ 
tion makes it difficult to use effectively, 
and it is thus a challenge for algorithm 
designers. We will assume that the pro¬ 
cessors are labeled consecutively around 
the ring from “ 1 ” to ” and that the z'th 
row of U and rth element of b are on pro¬ 
cessor We wish the solution vector x to be 
distributed in the same way. 

We should first note that the annotated 
program two paragraphs above would run 
perfectly well on such a topology, especial¬ 
ly with the given distribution of data. The 
only data movement, in fact, would be 
that of each submatrix x,- for use on each 
processor j, j>i. This data movement 


would be done transparently by the 
underlying operating system, and in this 
case the program would probably perform 
adequately. 

Yet in our dual role of programmer and 
algorithm designer we may have a par¬ 
ticular routing strategy that is provably 
good and that we wish to express explicitly 
in the program. For example, one efficient 
strategy is to “pipeline” thex, around the 
ring as they are generated. That is, the ele¬ 
ment Xj is passed to processor i-1, used 
there, passed to processor i-2, used there, 
and so on, as shown graphically in Figure 
7. There are several ways to accomplish 
this effect in the program, and we shall ex¬ 
plore two of them. 

The first requires the least change to the 
existing program, and is based on shifting 
the data by creating a partial copy of the 
solution vector on each processor, as 
shown in Figure 8. Note that the first four 
lines of this program are essentially the 
same as those given earlier. Figure 9 shows 
the construction of xpipe —note the cor¬ 


respondence between this diagram and the 
one in Figure 7. 

The second way to express the pipelin¬ 
ing of data is to interpret the algorithm 
from the outset as a network of dynamic 
processes rather than as a static set of vec¬ 
tors and arrays. In particular, we can con¬ 
jure up the following description of a pro¬ 
cess running on processor i: 

“Process i takes as input a stream of 
values x„, x„_ ; , .... x i+1 . It passes 
this stream of values to process i-1 
while back-substituting each value into 
bj. When the end of the stream is 
reached, it computes x, and adds this 
to the end of the stream being passed to 
process i-1.” 

Assuming the same distribution of U and b 
used earlier, we can represent this process 
description in ParAlfl as shown in Figure 
10. Note that xi is annotated for eager 
evaluation, to override the lazy evaluation 
of lists. Also note the correspondence be¬ 
tween this program and the last. The main 
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I result process(n,[]); (Comment: begin on processor n with empty stream) 

process(i.xstr) = = 

( xi = =(b[i]-sum(xstr,n,0))/U[i,i]; 
result if i = 1 then addtostr(xstr,#xi) 

else processfi -1 ,addtostr(xstr,#xi)); 
addtostr(old,x)= =old" [x]; 
sum(str,j,acc) = =if str=[] then acc 

else sum(tl(str),j-1,acc+hd(str)*U[i,j]) 

| Son i 


Figure 10. Program for Ux = b simulating network of processes. 


ringtocube(i) = =v[i]; 
v= =mka(n .graycode); 
graycode(i) = =if i < 2 then i 

else ( result v[2*mid-i-1]+mid; 
mid = =2**log2(i) ) 



Figure 11. Program to 
embed ring in 
hypercube. 


Figure 12. Embedding 
of ring of size 8 into 
3-cube. 


difference is in the choice of data structure 
for x— a list is used here, resulting in a 
recursive structuring of the program, 
whereas a vector was used previously, 
resulting in a “flat” program structure. 
Choices of this kind are in fact typical of 
any suitably rich programming language, 
and are equally important in parallel and 
sequential programming. Different data 
structures can, of course, be mapped in 
different ways to machines, but in this ex¬ 
ample the annotations are essentially the 
same in both programs. 

To carry this example one step further, 
let us now consider running any of the 
above ParAlfl programs on a multiproces¬ 
sor with a hypercube interconnection 
topology rather than a ring. One way to 
accomplish this is to simulate a ring in a 
hypercube by some suitable embedding. 
Probably the simplest such embedding is 
the reflected gray-code, captured by the 
ParAlfl functions shown in Figure 11. In 


that figure, log2(i) returns the base-2 log¬ 
arithm of /, rounded down to the nearest 
integer (the vector v is used to “cache” 
values of graycode(i)). For example, 
Figure 12 shows the embedding of a ring of 
size 8 into a 3-cube. 

If we then replace the previous anno¬ 
tations “... Son i” with “... Son 
ringtocube(i ),” we arrive at the desired 
embedding. Note that the code for the al¬ 
gorithm itself did not change at all, just the 
annotations. Of course, a more efficient 
algorithm for the hypercube might exist or 
the initial data distribution might be dif¬ 
ferent, and both cases would naturally re¬ 
quire recoding of the main functions. 


W hen viewed in the broad scope 
of software development meth¬ 
odologies, the use of para-func¬ 
tional programming suggests the follow¬ 
ing scenario: 


1. One first conceives of an algorithm 
and expresses it cleanly in a func¬ 
tional programming language. This 
high-level program is likely to be 
much closer to the problem specifica¬ 
tions than conventional language 
realizations, thus aiding reasoning 
about the program and facilitating 
the debugging process. 

2. Once the program has been written, it 
is debugged and tested on either a se¬ 
quential or parallel computer system. 
In the latter case, the compiler ex¬ 
tracts as much parallelism as it can 
from the program, but with no in¬ 
tervention or awareness on the part of 
the user. 

3. If the performance achieved in step 
two does not meet one’s needs, the 
program is refined by affixing an¬ 
notations that provide more subtle 
control over the evaluation process. 
These annotations can be added 
without affecting the program’s 
functional behavior. 

There are two aspects of this methodology 
that I think significantly facilitate program 
development: First, the functional aspects 
of a program are effectively separated 
from most of the operational aspects. Sec¬ 
ond, the multiprocessor is viewed as a 
single autonomous computer onto which a 
program is mapped, rather than as a group 
of independent processors that carry out 
complex communication and require com¬ 
plex synchronization. Together with the 
clean, high-level programming style af¬ 
forded by functional languages, these two 
aspects promise to yield a simple and 
effective programming methodology for 
multiprocessor computing systems. 

Extensions and implementation issues. 
In this article I have presented only the 
fundamental ideas behind para-functional 
programming. Work continues on several 
advanced features and alternative annota¬ 
tions that provide even more expressive 
power. These include: (1) annotations that 
reference other operational aspects of a 
processor, such as processing load; (2) 
mappings to operating system resources, 
such as disks and I/O devices; (3) in¬ 
troduction of nondeterministic primitives 
where needed; and (4) annotations to con¬ 
trol memory usage. The latter two features 
are especially important, since they allow 
one to overcome two traditional objec- 


COMPUTER 























tions to programming in the functional 
style: the inability to deal with the 
nondeterminism that is prevalent, for ex¬ 
ample, in an operating system, and ineffi¬ 
ciency in handling large data structures. 
Space limitations preclude me from delv¬ 
ing into such issues, but the reader can find 
additional details in Hudak and Smith 10 
and Hudak. 11 

In addition, by concentrating in this ar¬ 
ticle on how to express parallel computa¬ 
tion, I have left unanswered many ques¬ 
tions about how one can implement a 
para-functional programming language. 
In recent years great advances have been 
made in implementing functional lan¬ 
guages for both sequential and parallel 
machines, and much of that work is appli¬ 
cable here. In particular, graph reduction 
provides a very natural way to coordinate 
the parallel evaluation of subexpressions, 
and solves problems such as how to 
migrate the values of lexically bound 
variables from one processor to another. 
At Yale a virtual parallel graph reducer 
called Alfalfa is currently being im¬ 
plemented on two commercial hypercube 
architectures: an Intel iPSC and an NCube 
hypercube. This graph-reduction engine 
will be able to support both implicit 
(dynamic) and explicit (annotated) task 
allocation. The only difficult language 
feature to support efficiently in para- 
functional programming is a mechanism 
for referencing elements in a distributed 
array; in most cases this is quite easy, but in 
certain cases it can be difficult. Although 
good progress has been made in this area, 
the work is too premature to report here. 

Related work. The work that is most 
similar in spirit to that presented in this 
article is E. Shapiro’s systolic program¬ 
ming in Concurrent Prolog 12 ; the map¬ 
ping semantics of systolic programming 
was derived from earlier work on “turtle 
programs” in Logo. Other related efforts 
include those of R. M. Keller and G. Lind- 
strom, 13 who, independent of our re¬ 
search at Yale and in the context of func¬ 
tional databases, suggest the use of an¬ 
notations similar to mapped expressions; 
and F. W. Burton’s 14 annotations to the 
lambda calculus to provide control over 
lazy, eager, and parallel execution. A more 
recent effort is that of N. S. Sridharan, 15 
who suggests a “semi-applicative” pro¬ 
gramming style to control evaluation 
order. All in all, these efforts contribute to 


what I think is a powerful programming 
paradigm in which operational and func¬ 
tional behavior can coexist with little 
adverse interaction. O 
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Annual IEEE Design Automation 
Workshop 


Sponsored by 

the IEEE Design Automation 
Technical Committee 

Theme: The Impact of Artificial Intelligence on 
Computer Aided Design 

Place: Gold Canyon Ranch, Apache Junction, Arizona 
Time: January 21-23,1987 
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Purpose ol the Workshop 

The purpose of this year's Design Automation Workshop is assess the role of artificial intelligence in 
the design automation process. Specifically the workshop will have sessions on: 

1. The Fundamentals of Logic Based Programming - essentially a a tutorial on the logic based 
programming techniques as employed in such languages as PROLOG and LISP. 

2. Critique of Existing Al Systems - the strong and weak points of existing Al systems will be 
discussed. Most of the systems discussed here will be in the digital design area but other 
examples may be drawn from other fields if the experience there seems relevant. 

3. New Potential Application Areas for Al - here the discussion will center on defining a list of 
important problems in the CAD area which could benefit from the application of artificial 
intelligence techniques. 

4. New Advances in Artificial Intelligence and Related Support Fields - in this area discussion will 
center on new trends in artificial intelligence and their potential impact on CAD. One area of 
particular interest here is the emergence of computer architectures specifically designed to 
interpret rule based programs, e g. the Japanese 5th Generation Machines. 

Workshop Location 

Gold Canyon Ranch is located in the scenic Superstition Mountains, 25 miles east of Phoenix. Arizona. 
The location thus provides a relatively secluded location for the workshop yet is conveniently located 
close to a major airport. Recreation activities include horse back riding, hiking, swimming, golf, and 
tennis. The elevation of the ranch is 1715 feet. The climate is very sunny and warm and the humidity 
very low. 

Participation In the Workshop 

Attendance at the workshop is limited to 55 persons. To participate in the workshop, please submit 
a short summary of your interests and activities. If you would like to make a presentation at the 
workshop also submit a short summary of your proposed talk. Send this information to the workshop 
chairman, Jim Armstrong. If you have any suggestions for session themes or would like to organize 
a session contact Gary Leive, the program chairman. 

More Information 

If you wish more information about the workshop, contact the workshop chairman. Also, as the time 
for the workshop draws nearer, you will receive a detailed program for the workshop as well as 
information on registration af and access to the Gold Canyon Ranch. 




Program Chairman 


Dr. Jim Armstrong 
Electrical Engineering Dept. 
Virginia Tech 
Blacksburg, VA 24061 
(703) 961-4723 
(703) 961-7078 


Gary Leive 

GE-Calma 

P. O. Box 13049 

Research Triangle Park, NC 

(919) 549-3613 
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high-level language 
computer 
architectures and 
gives case studies of 
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T his survey describes recent develop¬ 
ments in the field of high-level lan¬ 
guage computer architectures for 
VLSI. Myers 1 and Flynn 2 emphasized the 
near lack of fundamental advances in 
computer architecture concepts since the 
early 1960’s. However, current research 
focuses on reevaluating the tradeoffs 
related to an optimal division of com¬ 
plexity between hardware and software. 
Computer architecture addresses a fun¬ 
damental question: Where should the 
hardware-software boundary be? 

The critical problem faced by architects 
is the conceptual gap. This conceptual gap 
results because programmers tend to think 
at the conceptual level of the HLL: the 
hardware architecture is typically based 
upon completely different concepts. Of 
the examples presented in this article, 
some reduce the conceptual gap by raising 
the level of the assembly language. Other 
examples we study eliminate the concep¬ 
tual gap entirely by removing the assembly 
language and making the HLL the machine 
language. 

A classification of HLL 
computer architectures 

In this article we classify major com¬ 
puter architectures directed toward reduc¬ 
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ing the conceptual gap between program¬ 
ming language and architecture. We 
obtained our classification, based on the 
correspondence between the machine lan¬ 
guage and the HLL, by merging the ex¬ 
isting classifications. 3,4>1 In a traditional 
architecture, the HLL program undergoes 
an extensive translation process (compila¬ 
tion) to eventually execute in a relatively 
low-level machine language. At the other 
extreme, direct-execution architecture, the 
HLL is itself the machine language. In 
other words, the HLL is executed without 
intermediate translation. Other architec¬ 
tures fall between these extremes. 

Our classification divides HLL ar¬ 
chitectures into two basic classes: direct 
execution and indirect execution. The in¬ 
direct-execution architectures further 
break down into reduced and complex ar¬ 
chitectures. Complex architectures divide 
into language-directed and language- 
corresponding: Type A (translation in 
software) and Type B (translation in 
hardware). This breakdown is shown in 
Figure 1. 

Reduced architectures 

Reduced architectures represent an at¬ 
tempt to reduce the conceptual gap by in¬ 
creasing the performance of the architec- 
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Figure 1. One possible classification of HLL computer architectures. 


ture when executing high-level languages. 
Instruction sets for reduced architectures 
are chosen after extensive study of com¬ 
piler-generated code. Only the most fre¬ 
quently used instructions are selected for 
hardware implementation; the others are 
synthesized from groups of the basic in¬ 
structions. Thus, RISC instruction sets are 
oriented toward executing compiled code. 

Some people believe that reduced archi¬ 
tectures cannot be treated as a class of 
high-level language architectures. How¬ 
ever, we treat a compiler (with code gener¬ 
ator, code reorganizer, and code op¬ 
timizer) as an integral part of a reduced 
instruction set computer. Reduced instruc¬ 
tion set computers, typically programmed 
only in HLLs, are often faster for exe¬ 
cution of HLL code than other types of 
HLL computers. Because of this and the 
numerous HLL support features of many 
RISCs, we feel justified in treating RISCs 
as HLL computers. 

Let us note here that reduced instruc¬ 
tion set computers are becoming increas¬ 
ingly popular in the research and commer¬ 
cial environments. In addition to the 
examples we present in this article, a varie¬ 
ty of new approaches are exemplified by 
the Ridge 32, 5 Pyramid 90X, 6 CMOVE, 
PIPE, 7 and others. 

IBM 801. An experimental system 
started in 1975 at the Thomas J. Watson 
Research Center, the IBM 801 was the first 
of the new generation of RISCs. The re¬ 
search team wanted to provide a better 
cost/performance ratio for executing pro¬ 
grams in high-level languages. 

One major characteristic of the IBM 
801 is that each instruction executes in a 
single machine cycle. Thus, each of the in¬ 
structions is relatively primitive when 
compared to the instructions of a complex 
architecture, which may take many cycles 
to execute. Three advantages are claimed 
for this approach: efficient interruptibili- 
ty, ease of optimization, and possible 
precomputation at compile time. 8 

The first advantage, efficient interrupt- 
ibility, proves especially important for I/O 
processors (and there are some indications 
that the IBM 801 is intended to be used as 
one). Since the IBM 801 has only short, 
single-cycle instructions, the design team 
reasoned that it would be acceptable to 
wait for instruction boundaries before 
checking for and servicing interrupts. 


Unlike some complex instruction set com¬ 
puter (CISC) processors, forced because 
of their multicycle instructions to service 
interrupts during the middle of instruction 
execution, the IBM 801 does not have to 
save a large internal context to handle the 
interrupt. This solution fits the RISC 
philosophy, because it reduces the com¬ 
plexity of the control logic. Moreover, it 
does not drastically affect the respon¬ 
siveness of the processor to external events 
because, on average, the interrupt is ser¬ 
viced after only half a cycle. 

Simple single-cycle instructions are 
easier to optimize because the details of 
the instructions are visible at the level of 
the compiler. For example, it may be possi¬ 
ble for the optimizer to remove some parts 
of what would be a single CISC instruc¬ 
tion, such as address computation, outside 
a loop. Single-cycle instructions also make 
it easier to precompute parts of a complex 
instruction at compile time. For example, 
if one of the operands of a multiply in¬ 
struction is a constant, it may be possible 
to replace the multiplication by a more ef¬ 
ficient sequence of shifts and adds. 

A second major characteristic of the 
IBM 801 is that great effort has been taken 
to eliminate or reduce CPU idle time 
because of storage access and branches. 
For example, the branch instructions take 
on two different forms: branch-and-exe- 
cute and regular branch. 


The branch-and-execute form bypasses 
many of the problems normally associated 
with executing branch instructions in a 
pipelined processor (such as flushing the 
pipeline) by using a delayed branching 
scheme. To make the delayed branching 
scheme feasible, the compiler’s code 
generator inserts one or more NO-OP in¬ 
structions after each branch instruction. 
Then the compiler’s optimizer attempts to 
move (or duplicate) more useful instruc¬ 
tions to take the place of the NO-OPs. For 
the IBM 801 compiler, this succeeds in 60 
percent of the cases, on average. 8 At exe¬ 
cution time, the instruction following a 
branch instruction is executed whether or 
not the branch is taken. 

The compiler uses the other branch in¬ 
struction, the regular branch, when it 
can’t find suitable instructions to move 
after the branch instruction. The CPU 
locks the pipeline while the target instruc¬ 
tions are being fetched for the regular 
branch. The regular branch instruction is 
the kind used in conventional pipelined 
processors. 

Unlike some of the other RISCs, the 
IBM 801 implements many of the pipeline 
interlock functions in hardware. For ex¬ 
ample, if the destination register of in¬ 
struction i is used as the source register for 
instruction /+ 1, then the hardware delays 
instruction i +1 until instruction i has 
completed. This contrasts with the UC 
Berkeley RISC II, which uses internal for- 
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Figure 2. Overlapped register windows in RISC II. 


warding to eliminate pipeline conflicts, 
and also with the Stanford MIPS, which 
rearranges code before execution to 
eliminate pipeline conflicts prior to exe¬ 
cution. (MIPS and RISC will be discussed 
later.) 

The third, and perhaps most important, 
identifying feature of the IBM 801 is that 
the architecture attempts to ease the job of 
the system programmer. The hardware 
was developed in conjunction with an in¬ 
novative, highly successful compiler for 
the PL-8 language, and all programs for 
the system are written in this language. 
The architecture provides a regular and or¬ 
thogonal set of simple instructions, easily 
generated by the compiler 9 and, consis¬ 
tent with the RISC philosophy of shifting 


some of the burden to the compiler, easy to 
optimize. 

The IBM 801 was implemented in small- 
to-medium-scale integration emitter- 
coupled logic. This contrasts with all other 
examples of reduced architectures pre¬ 
sented in this article, which are oriented 
toward VLSI. The IBM 801 exhibits high 
performance and does not suffer as much 
of a penalty in memory usage as one might 
expect for a RISC machine. One example: 
the inner loop of a heap sort, which the 
IBM 801 executes in 35 cycles. An IBM 
S370/M168 executes the same program in 
96 (longer) cycles. The size of the code for 
the IBM 801 averages a surprising 0.9 
times the size of the same code for the IBM 
S370/M168. 8 


UC Berkeley RISC. A strong research 
effort at the University of California, 
Berkeley has resulted in the design, layout, 
and production of RISC I, 4 RISC II, 10 
and an instruction cache chip for RISC 
II. 11 The philosophy of the UC Berkeley 
Reduced Instruction Set Computer is to 
provide efficient support for high-level 
languages using simple hardware. RISC I 
was designed and fabricated over the 
course of 1981-82 and completed in the 
summer of 1982. RISC II was designed 
and fabricated in 1981-83 and completed a 
year after RISC I. We concentrate on 
RISC II, which offers higher performance 
through a larger register file and longer 
pipeline. 

The RISC II design resulted from a 
series of studies aimed at measuring the 
most frequently used high-level language 
features. The studies mixed programs in C 
and Pascal, none oriented toward numeric 
computation. Thus, RISC II targets 
general-purpose processing, more than 
number-crunching. 

RISC II efficiently supports (in hard¬ 
ware) two features important to high-level 
languages. First, it handles subroutine 
calls/returns with a clever register window 
scheme. The 138 registers are divided into 
overlapping windows of 32 registers each, 
as shown in Figure 2. Parameters are passed 
by placing them in the overlap area be¬ 
tween the registers of the calling routine 
and the called routine. The call instruction 
adjusts a window pointer so that the lower 
numbered registers of the calling routine 
become the higher numbered registers of 
the called routine. 

The register windows avoid the over¬ 
head of placing parameters on a stack and 
saving registers between subroutine calls. 
This overhead consumes a significant por¬ 
tion of execution time for the HLL pro¬ 
grams studied. 10 

The large register file also makes it 
possible for most operands to be kept in 
registers, rather than in memory. RISC is 
a load-store architecture whose compiler 
tries to keep the most often used operands 
in registers. Consistent with the reduced 
architecture philosophy is the inclusion of 
a PC-relative addressing mode for load 
and store instructions. The PC-relative 
mode required no extra hardware since it 
was already being used for the branch in¬ 
struction. 

The second major feature of RISC II 
that supports HLL execution is the de- 
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Figure 3. The MIPS instruction pipeline, showing (a) pipeline structure, (b) HLL state¬ 
ment, (c) output of the code generator, and (d) output of the code reorganizer. 


layed branch. RISC circumvents the prob¬ 
lems normally associated with executing 
branch instructions in a pipelined pro¬ 
cessor (such as flushing the pipeline) by 
using a delayed branching scheme similar 
to the IBM 801 branch-and-execute. This 
feature is especially important when exe¬ 
cuting modular, HLL programs, which 
have a large number of constructs that 
translate into branches (call, return, if, 
while, etc.). 

The transistor count for the completed 
RISC II design is 41,000, suiting it for 
high-yield silicon VLSI fabrication. 
Because much of the chip area is taken up 
by large, regular features (such as the 
register file), the design took relatively 
little time. This short design and im¬ 
plementation time is another advantage of 
reduced architectures, since advances in 
technology can be used very quickly. 

Stanford MIPS. The Stanford Micro¬ 
processor without Interlocked Pipe 
Stages, or MIPS, is an ongoing research 
project at Stanford University. The re¬ 
search attempts to optimize the division of 
functions between software and hardware 
in a high-performance workstation. The 
research started in 1981; the first chip was 
fabricated in 1983. Current research con¬ 
centrates on the MIPS-X and MIPS 
Element. 

As in most reduced architectures, all 
MIPS instructions are kept simple so that 
they will execute in one machine cycle. The 
components of the instruction set were 
chosen after considering the frequency of 
use of each construct and the complexity 
required to implement it in hardware. 
Only instructions with high frequencies 
were chosen. Exceptions are some operat¬ 
ing system-related instructions necessary 
for minimum functionality. Gross 12 
pointed out many of the same benefits 
claimed for the IBM 801, with the added 
plus that a reduced instruction set requires 
less precious VLSI area to decode. 

The above process describes what ar¬ 
chitects have been doing, knowingly or 
unknowingly, for the past thirty-odd 
years, but in this case the conclusions they 
reached are somewhat startling. An exam¬ 
ple of an unusual decision made as a result 
of the studies is the omission of byte 
addressing. A study suggests that byte 
load/stores have alow frequency of use. 13 
Supporting byte addressing would require 


the inclusion of a byte aligner in the critical 
data path between memory and the CPU, 
and estimates indicate that the aligner 
would create a 15 to 20 percent perfor¬ 
mance penalty for all instructions, not just 
the byte-oriented ones. Therefore, the 
decision was made to support word ad¬ 
dressing only. To still permit efficient ac¬ 
cess to data organized as bytes in the 
absence of byte addressing, MIPS pro¬ 
vides byte insert and byte extract instruc¬ 
tions, which operate on word-length data 
already located in a register. 

MIPS achieves high performance 
through heavy pipelining. It also uses a 
delayed branch similar to RISC II and the 
IBM 801. The organization of the pipeline 
is shown in Figure 3a. 

The outstanding feature of MIPS, as 
hinted by its acronym, is that the pipeline 
has no hardware to prevent resource con¬ 
flicts between stages. Instead, the instruc¬ 
tion sequence output by the compiler 
undergoes a transformation, or reorder¬ 
ing, that removes all pipeline conflicts 
prior to execution. Reordering is done 
after compilation by a separate program, 
the reorganizer. 

The reorganizer effectively splits the ar¬ 
chitecture into two levels: assembly and 
machine. At the assembly level, the 
machine appears not to be pipelined and 


the programmer need not be concerned 
with pipeline hazards and the conse¬ 
quences of delayed branches. At the 
machine level, the code is reorganized to 
avoid pipeline conflicts and to take advan¬ 
tage of delayed branches. A simple exam¬ 
ple for an HLL assignment statement is 
shown in Figure 3b. Straightforward code 
generation produces code that cannot exe¬ 
cute correctly because of pipeline con¬ 
straints (Figure 3c). The second LOAD in¬ 
struction uses the fifth pipeline stage to 
load the register, and the ADD instruction 
uses the same register in the third pipeline 
stage to perform an add. 12 These two 
stages overlap if the LOAD and the ADD 
instructions are executed consecutively, so 
the correct code is obtained after code 
reorganization (Figure 3d), which inserts a 
NO-OP or, if possible, more useful code 
from elsewhere in the program. 

Effective use of the MIPS hardware re¬ 
quires a sophisticated compiler technol¬ 
ogy. In addition to handling delayed 
branches, as required by RISC II and the 
IBM 801, the MIPS compiler-reorganizer 
must ensure that executed code is free 
from pipeline hazards. (As an aside, we 
should note that the code reorganization 
technique could be applied to a processor 
with interlocked pipe stages for a corre¬ 
sponding increase in performance.) 
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Another interesting architectural deci¬ 
sion can be noted for MIPS: there are no 
condition codes. Supposedly, condition 
codes needlessly complicate a VLSI design 
because they are an irregular structure and 
because careful decoding is required to 
determine which instructions affect the 
condition codes and which do not. 13 In¬ 
stead of using condition codes to control 
program flow, MIPS uses a ‘ ‘compare and 
branch” instruction, which provides the 
comparison essentially for free because the 
instruction takes only one cycle. 

A version of the MIPS architecture was 
implemented in NMOS VLSI in 1983. The 
chip has 24,000 transistors and a basic cy¬ 
cle time of 250 ns. 14 Its performance in 
executing standard HLL benchmarks such 
as the Towers of Hanoi, Quicksort, and 
matrix multiplication exceeds that of the 
Motorola 68000 by a factor of more than 
five. 

Inmos Transputer. The Transputer sys¬ 
tem is a commercial family of compatible, 
high performance microprocessors devel¬ 
oped by the English company, Inmos. 15 
Of all the examples we consider, Trans¬ 
puter strays farthest from the reduced ar¬ 
chitecture approach by including an au¬ 
tonomous I/O processor, a DMA inter¬ 
face, and four serial processor-to-proces- 
sor interfaces on-chip. The Transputer 
design team set forth ambitious design 
goals to provide non-RISC-like features 
such as multiprocess and multiprocessor 
support and communications support for 
industry standard peripherals. 

Transputer does resemble other reduced 
architectures in that it has a tightly en¬ 
coded, simple instruction set with only 16 
opcodes. Instructions are encoded in only 
eight bits, the upper nibble being the func¬ 
tion code and the lower nibble, the func¬ 
tion. For example, the function code for 
the OPERATE instruction is 15 and the 


function nibble specifies one of 16 possible 
operations (ADD, SHIFT, etc.). 

Transputer instructions are indepen¬ 
dent of the word length of the processor 
and so can be used on other non-32-bit 
members of the Transputer family. For ex¬ 
ample, immediate operands are loaded 
into registers nibble by nibble using the 
“prefix” instruction, which loads a four- 
bit data item into a register and shifts it left 
four bits. The prefix instruction can be 
used to generate immediate operands of 
any size up to the word length of the pro¬ 
cessor. For example, loading a register 
with the immediate constant 17B (hexa¬ 
decimal) can be synthesized as follows: 

prefix 1H 

prefix 7H 

load constant 0BH 

Like the RISC II machine, Transputer 
has a large register file. It consists of 4K 
bytes of on-the-chip fast RAM, and is used 
entirely for storage of data rather than as a 
cache. Instruction prefetch is done in 
groups of four instructions (32 bits) to 
keep up with the speed of the processor. 

Only six registers are associated with 
each process: A, B, C, the program count¬ 
er, the workspace pointer, and the operand 
register, as shown in Figure 4. The instruc¬ 
tions are either zero-address or one- 
address. The OPERATE instruction is the 
only zero-address instruction; it implicitly 
uses the A, B, and C registers as an evalua¬ 
tion stack. The one-address instructions 
specify a four-bit offset to either the A 
register or the workspace pointer. 

Transputer’s support for multiprocess¬ 
ing operation is made more efficient by the 
small number of registers associated with 
the current process: six registers can be 
saved very quickly. A microcoded silicon 
scheduler provides priority-based schedul¬ 
ing among active processes, and com¬ 
munication between processes is im¬ 
plemented in the form of messages. 

One of the truly astounding capabilities 
built into the Transputer is the support for 
interprocessor communications. Instead 
of using a global bus (like RIMMS, in¬ 
troduced later), Transputer uses four 
serial point-to-point links that are com¬ 
pletely independent of the memory and 
I/O buses. The serial link operates at 
lOM-bits/s in all members of the Trans¬ 
puter family, facilitating inter-member 
compatibility. The link can be either half 
or full duplex. It has a built-in protocol to 


ensure the correct exchange of messages. 
The transistor count and cycle time of 
T424 are 250,000 and 50 ns, respectively. 

RIMMS. The RIMMS project (for 
Reduced Instruction set architecture for 
Multi-Microprocessor Systems) is an on¬ 
going project at the University of Reading, 
England. The RIMMS philosophy differs 
from that of the IBM 801, RISC II, or 
MIPS in that RIMMS concentrates on in¬ 
creasing performance by combining many 
simple processors capable of functioning 
as part of a multi-microprocessor sys¬ 
tem, *6 rather than by increasing the per¬ 
formance of each processor. 

The essential point is that low power 
MOS VLSI circuits most likely will not 
better the speed of 10-year-old TTL cir¬ 
cuits. Therefore, the route to increased 
performance is to replicate processors and 
allow them to work concurrently, rather 
than to try to increase the performance of 
the uniprocessor. 

The RIMMS architecture can be treated 
on two levels: the RIMMS level with many 
processors working as a system; and the 
RIMMS single processor level. We will 
first describe the RIMMS level shown in 
Figure 5. The RIMMS system consists of 
255 processors, each with its own memory, 
connected by a single, token-passing bus. 
Memory addresses consist of 16 bits, the 
most significant byte being the global ad¬ 
dress and the least significant byte being 
the local address. Global addresses of zero 
indicate references to local memory, while 
nonzero global addresses indicate refer¬ 
ences to memories of other processors. 
Processors may freely access data items in 
local or nonlocal memory. However, if a 
processor attempts to fetch an instruction 
from nonlocal memory, the instruction is 
executed by the nonlocal processor. This 
forms the basis for the FORK instruction. 

Communication between the processors 
takes the form of 34-bit packets consisting 
of a 2-bit operation field, a 16-bit address, 
and a 16-bit operand. The four possible 
operations are LOAD, STORE REGIS¬ 
TER, STORE MEMORY, and EXE¬ 
CUTE. Processors may either reject or 
accept packets from the global bus. For 
example, the EXECUTE packet is ac¬ 
cepted only when the nonlocal processor is 
idle, whereas the STORE MEMORY is ac¬ 
cepted regardless of the processor status. 
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On the RIMMS level, each microcom¬ 
puter consists of three major components: 
the processor, the memory controller, and 
a memory. To facilitate process migration, 
there are only two registers in the pro¬ 
cessor: a data pointer and an instruction 
pointer. The first, experimental implemen¬ 
tation uses custom VLSI for the processor 
and memory controller (on separate chips) 
and uses commercially available memory. 
The transistor count for the completed 
CPU chip was 17,000. 

VM Architecture. The Vertical Migra¬ 
tion, or VM, architecture is an ongoing 
research project at Purdue University. 17 
The primary goal is to provide cheap, effi¬ 
cient support for high-level language 
primitives and the most frequent non¬ 
primitives in microcode. VM targets dedi¬ 
cated high-performance applications such 
as robotics, signal processing, and process 
control. The VM architecture is tailored to 
a gallium arsenide (GaAs) implementa¬ 
tion, where the fetch time is much greater 
than the execution time. 

Extensive studies of general-purpose 
HLL programs indicate an interesting 
characteristic: most of the statements have 
a single operator and two operands. 1 
Thus, it makes little sense to increase the 
number of processing elements (PEs) in a 
general-purpose architecture, as the extra 
PEs will be used infrequently. Studies of 
signal processing HLL programs, 18 
however, indicate a much higher fre¬ 
quency of multi-operand, multi-operator 
statements. Thus, it does make sense to in¬ 
crease the number of PEs in a signal pro¬ 
cessing environment. The existence of 
multiple PEs enables a better match be¬ 
tween instruction fetch time and instruc¬ 
tion execution time in technologies such as 
GaAs. 

As implied by the name, VM has very 
high level microcode. In many cases, it is 
possible to map high-level language con¬ 
structs one-to-one onto VM’s microcode. 
This level of parallelism results from two 
features of the VM architecture: the mul¬ 
tiport register file and multiple processing 
elements in the execution unit. 

The multiport register file makes it 
possible to simultaneously access two or 
more operands and present them to the 
ALU(s). This is why it is possible to map 
HLL non-primitives into microcode. The 
level of primitivity supported depends on 


the number of register file ports and can be 
extended by increasing the total number of 
ports. If the complexity of the HLL state¬ 
ment is such that it cannot be executed in 
one microcode instruction, then it is split 
into two or more microinstructions. 

Figure 6 shows how the multiport 
register file and the multiple ALUs might 
be configured to execute the HLL state¬ 
ment Z: = D(0) +1 + D(2)«J. This HLL 
statement can be executed in one micro¬ 
cycle with the VM HLL-oriented architec¬ 
ture. Note that the multiplicity of memory 
ports may exist as a logical rather than as a 
physical concept. Actually, different 
operands may be fetched sequentially in 
time from a single port to save chip area. 
This will increase the register operand 
fetch time, but will not reduce the pro¬ 
cessor speed, as the ratio of instruction 
fetch time to instruction execution time is 
high in GaAs. 


Language-directed 

architectures 

Language-directed architectures repre¬ 
sent an extension of the traditional ar¬ 
chitectures. They reduce the conceptual 
gap by tailoring the constructs of the 
machine code to a form that can be used 
by compilers for easy synthesis of HLL 
constructs. Examples of this tailoring are 
special addressing modes for accessing 
complex data structures (Motorola) or 
object-oriented instructions (Intel). 

Such architectures can provide perfor¬ 
mance benefits and increase software 
reliability. However, orienting the pro¬ 
cessor towards one specific language can 
make it less suitable for some other lan¬ 
guage. For example, providing a loop con¬ 
struct that performs the test at the end of 
the loop would not improve speed for lan- 
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Figure 7. A typical iAPX 432 system, showing (a) an iAPX 432 multiprocessor system and 
(b) the efficiency function. 


guages (like C) that perform the test at the 
beginning of the loop. Furthermore, the 
addition of the loop instruction may ac¬ 
tually slow down all other instructions 
because of the larger and slower decoder. 

In most cases, stack architectures also 
belong to the class of language-directed ar¬ 
chitectures. Well-known examples include 
the Burroughs B1700 and B5500 families, 
Hewlett-Packard HP3000, and others. 
However, we treat stack architecture as a 
concept of computer organization rather 
than as a concept of architectural support 
for HLLs, which it really is. Consequently, 
the elements of stack architecture can be 
found in all other HLL computer architec¬ 
ture classes mentioned in this article. 

In this section, we present six 32-bit, 
language-directed microprocessors: the 


Intel iAPX 432, the Motorola MC68020, 
the Hewlett-Packard FOCUS, the Nation¬ 
al Semiconductor NS32032, the Zilog 
Z80000, and the Digital Equipment Cor¬ 
poration VLSI VAX. Some authors would 
not classify these complex instruction set 
microprocessors in language-directed ar¬ 
chitectures. However, their designers 
made every effort to make them efficient 
for execution of compiled HLL code. So 
we conditionally treat them as language- 
directed machines. Among microproces¬ 
sors not treated in detail here, those that 
deserve more attention are the AT&T WE 
32 (Bell Labs) and some new Japanese 
products. 

Intel iAPX 432. The Intel iAPX 432, a 
commercially available microprocessor in¬ 


tended for software-intensive applica¬ 
tions, attempts to make the programmer’s 
job easier, thus reducing the cost of devel¬ 
oping and maintaining software. 1 The 
iAPX 432 is especially suited to applica¬ 
tions where security and fault tolerance are 
important. 

Other outstanding features of the iAPX 
432 are easy expansion and reliability. In¬ 
creased system performance is obtained 
without changing software by increasing 
the number of processors up to a total of 
five general data processors (GDP). How¬ 
ever, five GDPs on a single bus have a per¬ 
formance equal to three independent 
GDPs, and a further increase in the num¬ 
ber of GDPs is ineffective because of 
meitiory contention over the single bus. 
An iAPX multiprocessor system is shown 
in Figure 7a. The corresponding efficiency 
function is given in Figure 7b. 

Reliability of iAPX 432 hardware and 
software are ensured by the fault handlers, 
which allow programs to detect hardware 
and software errors. In addition, it is 
possible to use a redundant processor con¬ 
figuration, comparing results from two 
processors to detect errors. 

The iAPX 432 has a large address space 
and extensive floating-point facilities. 
Operands can be flexibly addressed as 
either displacements of a segment or as 
locations on a stack. Floating-point 
operands can be represented by 32, 64, or 
80 bits, and the rounding mode is under 
program control. 

Access to stored data is accomplished 
by means of capabilities, which allow 
precise protection for access, modifica¬ 
tion, and execution of segments. Associ¬ 
ated with each segment (e.g., a context 
segment) is a set of capabilities describing 
the access and creation rights for each pro¬ 
cess. Should a process have insufficient ac¬ 
cess rights for a particular segment, then a 
fault occurs if the process tries to access 
the segment. 

The hardware implementation of the 
iAPX 432 takes the form of two chips with 
a total of 160,000 transistors. The 43201 
fetches and decodes instructions, while the 
43202 performs addressing and arithmetic. 

The new Intel 80386 32-bit microproces¬ 
sor, with 270,000 transistors, is now com¬ 
mercially available in small quantities. The 
iAPX 432 is no longer in production. 

Motorola MC68020. The Motorola 
MC68020 is a recent member of the well- 
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Figure 8. The HP FOCUS pipeline. 


known MC680X0 family. It has full, 
32-bit, nonmultiplexed address and data 
buses, unlike earlier members of the fami¬ 
ly. 19 The outstanding features of the 
MC68020 are the on-chip instruction 
cache, the execution pipeline, a coproces¬ 
sor interface, and several additional ad¬ 
dressing modes useful for high-level lan¬ 
guage data structures. 

The instruction cache has a direct- 
mapped organization and consists of 64 
32-bit entries for a total of 256 bytes. 
Under normal operation, the user has no 
knowledge of the presence or absence of 
the cache. However, the MC68020 pro¬ 
vides a set of privileged instructions to 
allow the supervisor program to manipu¬ 
late the cache. For example, the contents 
of the cache, normally updated on a cache 
miss, can be frozen by setting the ap¬ 
propriate bit in the cache control register. 
When the cache is frozen, cache misses do 
not result in updating the cache. 

The CPU has a three-stage pipeline to 
improve execution speed. The pipe stages 
allow overlapped prefetch, immediate 
operand extraction, and decode of either 
three words of a single instruction or up to 
three successive instructions. The pipeline 
causes the execution time of some instruc¬ 
tions to completely overlap the execution 
of others, so that the effective execution 
time for those overlapped instructions is 

Further performance improvement 
comes from the ability to interface to spe¬ 
cial-purpose coprocessors. For example, 
floating-point instructions can be executed 
by the special floating-point MC68881 
chip, and memory management capabilities 
are incorporated into the MC68451 chip. 
The MC68020 incorporates a hardware 
communications protocol for synchronous 
coprocessor operation such that the main 
processor is idle while the coprocessor is 
running. Thus, the designers have ensured 
future upgradability of the current product. 
The MC68020 is implemented with 150,000 
transistors. 

HP FOCUS. The Hewlett-Packard 
FOCUS is a 32-bit CPU chip intended 
predominantly for use in Hewlett-Packard 
company products. The goal of the 
FOCUS design team was to improve 
design productivity, testability, and system 
performance. 20 The chip has been used in¬ 
ternally at HP since 1981. 


The most notable feature of the FOCUS 
CPU chip is the high level of performance 
it achieves. The design team determined 
that the principal potential bottlenecks in 
the machine’s performance were the 32-bit 
add, qualifier testing, bit extraction, and 
address computation. To optimize perfor¬ 
mance, special-purpose hardware was de¬ 
veloped for these critical functions. For 
example, the ALU has a full, 32-bit carry 
lookahead adder, and the shifter is capable 
of shifting up to 31 places in a single cycle. 
Another notable feature is the inclusion of 
28 identical 32-bit registers. 

To further enhance performance, the 
machine is heavily pipelined, as shown in 
Figure 8. The instruction stream is fed 
through a three-stage pipeline in which 
fetch, decode, and execute are over¬ 
lapped. In addition to the usual in¬ 
struction-level pipelining, the micro¬ 
architecture is also pipelined. 

Another notable feature provided by 
the CPU chip: hardware segmentation 
support. The address space consists of 
four segments: code, stack, global data, 
and external data. Each of the segments is 
accessed through 32-bit on-chip base ad¬ 
dress registers. 

Finally, the HP FOCUS design team ex¬ 
pended much effort in making the final 
implementation testable. There are 600 
metal probe pads on the chip’s surface for 
testing in the manufacturing phase. In ad¬ 
dition, special debug hardware is available 
for microcode-level, instruction-level, and 
HLL-level debugging. This hardware 
allows breakpoints, register modification, 
and single-stepping at either the micro¬ 
code or instruction level. Extra microcode 
allows high-level language breakpoints, 
single-stepping, and tracing. 20 

The resulting chip has 450,000 transis¬ 
tors and operates at a speed of 18 MHz. 
The performance of the chip is compared 


to a mainframe computer, and the 32-bit 
integer multiply time of 1.8 microseconds 
is certainly impressive. The chip utilizes a 
minimum feature size of one micron. 

National Semiconductor NS 32032. The 

National Semiconductor NS32032 is a 
32-bit CPU targeted towards high-perfor¬ 
mance applications, such as engineering 
and CAD workstations. The overriding 
philosophy of the NS32032 is to reduce 
bus interference between direct memory 
access, multiple CPUs, and graphics by 
reducing memory bus traffic. 21 

The NS32032 employs two ways of 
reducing memory bus traffic: it maximizes 
the information in each transfer and 
eliminates transfers altogether by keeping 
information where it is needed. 

The information per transfer is max¬ 
imized by using a wide, 32-bit bus and by 
compactly encoding the instruction set. 
The efficiency of the instruction set in 
terms of memory bandwidth is enhanced 
by variable-size displacements, special ad¬ 
dressing modes, and the lack of instruc¬ 
tion alignment restrictions. For example, 
the register-relative displacement address¬ 
ing mode allows the displacement to be en¬ 
coded in from one to four bytes. The short 
displacement mode, which is used most 
often, requires only one byte. Thus, 
memory traffic is reduced for the most 
commonly used displacement mode. 
Memory traffic is also reduced by the in¬ 
clusion of an eight-byte memory prefetch 
buffer. This buffer reduces the criticality 
of memory references by the CPU. 

The reduction in memory traffic is 
achieved at the expense of increased on- 
chip complexity. For example, decoding 
the variable length instructions requires 
extra decoding circuitry. However, the 
designers view this tradeoff as acceptable 
since they point to the memory bus as the 
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Figure 9. The NS 32032 data path. 


principle performance-limiting bottle¬ 
neck. 21 

The NS32032 supports keeping infor¬ 
mation where needed by providing eight 
general-purpose and eight floating-point 
registers (Figure 9). A sophisticated com¬ 
piler is required to perform flow analysis 
to determine what data to keep in the 
registers. If the most often used data is 
kept in the registers, the overall number of 
memory references is reduced. 

Lastly, the memory traffic for a 
NS32032 system is reduced by the memory 
management unit, which maintains an on- 
chip cache of the most recent virtual-to- 
physical address translations. The cache 
has a typical hit ratio of 0.98, so that only 
about two percent of memory accesses re¬ 
quire extra fetches from the memory. 

Zilog Z80,000. The Zilog Z80.000 is a 
high-performance 32-bit CPU developed 
for applications such as graphics, array 
processing, and desktop computing. The 
Z80.000 attempts to provide flexible and 
simple support for overall systems 
design. 22 

The Z80.000 incorporates many fea¬ 
tures common to the new generation of 


32-bit CPU chips: virtual memory man¬ 
agement, cache, pipeline, and multipro¬ 
cessing support. Its register file uses a 
clever scheme to increase its effective size 
for short operands. Any of the 16 32-bit 
registers can be used in ALU operations 
without affecting the high-order bytes. 
Thus, one can store several short operands 
in each register. 

The memory management provided on- 
chip is quite complete. The address space 
can use one of three models: segmented, 
linear, or compact. The CPU also per¬ 
forms virtual-to-physical address trans¬ 
lations on-chip. A buffer of the most 
recently used translations, the TLB, pro¬ 
vides a hit ratio of 0.96. It improves per¬ 
formance by reducing the average number 
of physical memory accesses needed for 
translation. 

In addition to the TLB, the Z80,000 also 
has a flexible, on-chip, 256-byte cache. An 
unusual feature of the cache is that it may 
be tailored to the application by changing 
its mode. The modes are instruction cache 
only, data cache only, both instruction and 
data cache, and local memory. Most other 
microprocessors provide only the simpler 
instruction caching capability (like the 
MC68020). 

The CPU has a six-stage pipeline similar 
in breakdown to the MIPS pipeline (see 
Figure 10). Unlike MIPS, however, the 
Z80.000 pipeline uses a hardware register 
interlock mechanism to eliminate hazards. 
The Z80.000 pipeline also does not use a 
delayed branching scheme to reduce the 
average delay because of pipeline flushing. 
Thus, both conditional and uncondition¬ 
al branches cause a rather severe perfor¬ 
mance penalty. 

According to Patel, 22 the 25-MHz 
CPU achieves a peak performance of 12.5 
MIPS executing simple instructions such 
as register-to-register moves and memory- 
to-register adds. The average performance 
reaches 3.7 MIPS, taking into account the 
memory and pipeline delays. 

DEC VLSI VAX. The Digital Equip¬ 
ment Corp. VLSI VAX is a high-perfor¬ 
mance VLSI implementation of the VAX 
family that achieves the performance of 
a VAX 11/780. The goals of the design 
team were to obtain fast address trans¬ 
lation and fast parsing of the variable- 
length instructions. 23 

The VLSI implementation of the VAX 
consists of three chips: an instruction fetch 


and execute chip (IE), a memory mapping 
chip (M), and an optional floating-point 
chip not discussed here. The IE chip con¬ 
tains the instruction prefetch, decode and 
execution hardware, while the M chip con¬ 
tains the cache, backup translation buffer 
(BTB), and other miscellaneous hardware 
(see Figure 11). 

The memory management support pro¬ 
vided by the M chip exceeds even that of 
the Zilog Z80,000. Address translation 
proceeds in several steps, the first of which 
is the mini-translation buffer (MTB) 
check. The MTB is a small, five-entry 
table of the most recently used trans¬ 
lations. If the MTB doesn’t contain the 
translation, the MTB is updated from the 
BTB. The BTB is a larger 512-entry table 
of recent translations. If the BTB doesn’t 
contain the translation, a full address 
translation sequence is executed. 

Translation in the MTB takes only one 
cycle; translation in the BTB takes two 
cycles; translation in memory requires 20 
cycles. The first two cases occur most fre¬ 
quently. Address calculation uses the same 
ALU as the instruction stream, unlike the 
Z80.000, which has a separate ALU for 
address calculations. 22 

The other major objective, efficient in¬ 
struction parsing, is handled by the IE 
chip. Instruction parsing is necessary in 
the VAX architecture because instructions 
may vary in length from 1 to over 100 bytes 
each. 23 The IE chip sequentially decodes 
the fields of the instruction and places 
their values into specific registers. The 
speed of the parsing task is increased by 
the use of parallelism in the microcode. 
During each microcycle, up to three op¬ 
erations can be performed in parallel: a 
main operation, a length/condition code 
operation and a miscellaneous operation. 
Thus, the IE chip can simultaneously load 
microregisters, increment the program 
counter, and prefetch instructions. 


Language- 
corresponding 
architectures-Type A 

The Type A architecture represents an 
attempt to reduce the conceptual gap by 
raising the level of the machine language 
closer to the high-level language. Type A 
and Type B architectures are similar in the 
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Figure 11. Block diagram of the VLSI VAX. 


Figure 10. The Z80000’s six-stage pipeline. 


sense that in both cases, the machine lan¬ 
guages have a one-to-one correspondence 
with the source language. However, the ar¬ 
chitectures differ in the method by which 
the source language is translated into 
machine language. This task is accom¬ 
plished by software in Type A and by 
hardware in Type B architectures. 1 Po¬ 
tential advantages of the Type A architec¬ 
ture include high execution speed, in¬ 
creased programmer productivity (since 
debugging is done at the HLL level), and 
better software reliability. 

We consider the Type A architectures of 
Scheme-79 and IBM APL. However, other 
work in the USA 1 and Japan 24 also 
deserves mention. Note that the research 
on both example machines started in the 
1970’s. However, the ideas from these ex¬ 
ample machines are still being used in 
various ongoing research projects. 

Scheme-79/81. The Scheme-79/81 re¬ 
search project at the MIT Artificial In¬ 
telligence Laboratory began in 1979. 25 
The goal of the research team was to 
design and fabricate a chip to execute a 
dialect of Lisp called Scheme-79/81. Ac¬ 
cording to the authors, Lisp was chosen as 
the target language because it is a simple 
language whose complex constructs are 
synthesized from simple operators and 
data types. 

Lisp programs are converted into ma¬ 
chine programs by a software translator. 
The machine program is executed by an 
interpreter residing in microcode. The 
machine program consists of a sequence of 
list nodes, each of which contains a 24-bit 
data field, a type field, and a bit used for 
storage allocation. The data field can 
consist of a literal, local pointer, global 
pointer, etc., as determined by the type 
field. The type field of the list node can be 


considered the opcode of the Scheme-79/81 
chip. 

The principal performance limitation of 
the Scheme-79/81 chip relates to the gar¬ 
bage collection of unused list nodes. The 
garbage collector traverses the node tree 
and marks all nodes that cannot be 
reached by the current program for collec¬ 
tion. The garbage collection and execution 
of the program cannot be done concur¬ 
rently since they share registers. 

Execution time for the Scheme-79/81 
chip is also limited by the implementation 
of the register file (single bus, unbuffered 
register cells) and the technology (slow 
decoder programmable logic arrays). 


However, the Scheme-79/81 chip shows 
performance from one to three times 
faster than that of a PDP-10 in the Lisp 
environment. 25 

IBM APL Machine. The IBM APL 
machine was developed at the IBM Scien¬ 
tific Center and completed in 1973. The 
goal of the research team was to design 
and implement a complete system for the 
execution of APL programs. The essence 
of this approach is implementation of 
HLL support through extensive micropro¬ 
gramming. 

The APL language was chosen because 
of its usefulness in numeric and data pro- 
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cessing and because of its power and flex¬ 
ibility. However, this choice contributed to 
some of the difficulties in the implementa¬ 
tion. Unlike compiler-oriented languages 
such as Fortran, APL is a highly dynamic 
language. A single statement may be exe¬ 
cuted in a completely different fashion, 
depending on the values of its arguments. 
For example, the statement “A gets 
B + C” may give either a scalar or a vector 
result, depending on the contents of 
variables B and C. Thus, APL is not well 
suited to efficient compilation, and much 
of the program processing must occur at 
execution time. 

The internal representation of the APL 
program is very similar to its external 
representation. A supervisor/translator 
program accepts the source program as in¬ 
put, and the translator section changes the 
program into reverse Polish for storage in 
the program memory. At execution time, a 
microcoded statement scanner and syntax 
analyzer process the code line-by-line. To 
achieve optimum efficiency, separate 
microcode routines are used for scalar and 
vector operations. Some of the more com¬ 
plicated APL operators are actually im¬ 
plemented by an APL-coded routine be¬ 
cause of control storage limitations. 

The hardware used to implement the 
APL machine was the IBM S360/M25, 
chosen for its writable control store and 
low cost. The total control memory avail¬ 
able is 16K bytes, and the 4K-byte aux¬ 
iliary memory holds the IBM 360 registers. 
In normal operation, the IBM 360 is first 
loaded with the IBM 360 emulator. Then 
the APL system routine is loaded and the 
system switches to the APL mode. 

The completed IBM APL machine 
shows a performance advantage only on 


vector operations with long operands. The 
loss of speed is attributed to the need to 
repeatedly scan and analyze the HLL 
statements. The research team suggested 
that specialized scanner/analysis hard¬ 
ware and vector processors would increase 
the speed of the machine. 

Language- 
corresponding 
architectures-Type B 

The Type B architecture, like the Type 
A architecture, represents an attempt to 
reduce the conceptual gap by raising the 
level of the machine language to a level 
closer to the high-level language. Unlike 
the Type A architecture, however, the 
Type B architecture translates the source 
language into the machine language using 
a hardware translator. Because the cor¬ 
respondence between the high-level lan¬ 
guage and the machine language is one- 
to-one, the high-level language can be 
considered the assembly language of the 
Type B machine. 

A potential advantage of the Type B 
architecture is its speed of assembly, since 
the assembler is implemented in hardware 
or microcode. However, the speed advan¬ 
tage is achieved at the expense of increased 
design and development cost and also at 
the expense of flexibility. The assembler 
(hardwired or microcoded) is difficult and 
time-consuming to design and inconve¬ 
nient to modify. Further research will tell 
whether the Type B architecture represents 
an optimal tradeoff between hardware 
and software. The answer, obviously, 
depends on the application and the impor¬ 
tance of fast assembly. 

We consider two examples of Type B 
architecture in some detail: Symbol and 
DELtran. However, some other work, 
such as that by Bose 26 and Halabi, 27 also 
deserves attention. Although the related 
research started in 1960’s and 1970’s, the 
ideas of Symbol and DELtran still appear 
in a number of current research projects at 
Intel, IBM, Aerospace Corporation, Stan¬ 
ford, Purdue, Illinois, etc. 

Symbol. The Symbol system was devel¬ 
oped by Fairchild beginning in the 
mid-1960’s. Fairchild built only one Sym¬ 
bol machine, later studied at Iowa State 
University. 1 The goal of the design team 


was to increase performance while incor¬ 
porating new technologies and reducing 
costs. The Symbol system suited general 
data processing applications rather than 
towards number-crunching. 

The Symbol system supported only one 
programming language: Symbol Pro¬ 
gramming Language, or SPL. SPL is a 
block-oriented language with some fea¬ 
tures of APL, PL/I, and Lisp. It has only 
two data types: scalars and structures. 
Arithmetic is all done in base 10. In the 
SPL, the programmer has to synthesize 
iterative loops from the conditional IF 
statement and the GOTO statement. 

A translator processor translates the 
SPL source code into the machine lan¬ 
guage based on a one-to-one correspon¬ 
dence. Since the machine language is a 
reverse Polish representation of the SPL, 
the translation process is relatively easy to 
do in hardware. An interesting detail of 
the translator processor is how it handles 
identifiers. For each block in the source 
program, the translator processor estab¬ 
lishes a name table. The name table, simi¬ 
lar to an assembler’s symbol table, con¬ 
tains the string representing the identifier 
and the identifier control word, which is a 
form of descriptor. Unlike the DELtran 
system discussed in the next section, Sym¬ 
bol stores the complete identifier rather 
than an encoded, compact representation 
of the identifier. 

Another notable feature of the Symbol 
system is the hardware memory manager. 
The memory controller executes many of 
the same functions as the (software) 
memory manager of an operating system 
(such as allocate memory, return memory 
to free list, etc.). In addition, the memory 
controller maintains a set of lists that allow 
translation of logical addresses to physical 
addresses. Garbage collection of unused 
pages is accomplished by a separate pro¬ 
cessor at low priority, which avoids tying 
up the system while memory is reclaimed. 

Hindsight allows us to identify some 
major deficiencies of the Symbol system. 
The outstanding problem: Symbol is 
oriented towards only one programming 
language. This reduces flexibility if pro¬ 
grammers wish to use other, more widely 
known languages such as Fortran or 
Cobol. Myers suggests that this problem 
might be overcome by developing special 
central processors and translators for 
other languages. 1 In addition, Symbol 
users found that implementing all of the 
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functions in hardware was ineffective. For 
example, the interface processor of the 
original system provided a text editor that 
required special hardware (terminals). 
However, at Iowa State the function of the 
interface processor was largely replaced by 
a flexible, easier-to-use software text 
editor. 

DELtran. DELtran is an experimental 
system developed at the Stanford Emula¬ 
tion Laboratory in 1977. The DELtran 
design team strove to reduce the concep¬ 
tual gap between the machine language 
and the source language and also the con¬ 
ceptual gap between the machine language 
and the machine’s execution interpreter. 

DELtran was designed to support a 
minimal subset of Fortran-II. Fortran-II 
represents a simpler target language than 
Pascal or PL/I because of the limited 
scope of its identifiers and because of its 
strongly typed nature. 

The architecture uses a universal host 
machine, in this case called Emmy, which 
is not biased towards any specific machine 
language. The Emmy machine is equipped 
with 4000 words of writable control store. 

The outstanding feature of DELtran is 
the mechanism used for referencing iden¬ 
tifiers. It is important to recognize a major 
difference between DELtran and tradi¬ 
tional architectures: DELtran instructions 
contain identifiers (variable names and 
labels) closely related to the source lan¬ 
guage rather than to the execution ar¬ 
chitecture. Thus, the machine instruction 
does not refer to machine registers or 
memory addresses. Instead, the name 
space of the instructions corresponds to 
the name space of the source language. In 
Fortran-II, the name space is unique for 
each environment (subroutine or func¬ 
tion) since routines at a lower lexical level 
may not access variables in higher lexical 
levels. 

DELtran uses an interesting scheme to 
minimize memory usage and execution 
time due to identifier referencing. Instead 
of referencing the identifier itself, the 
machine instruction uses a compact, en¬ 
coded representation of the identifier. 
Thus, the first variable in a (sub-) program 
is assigned the number 1 by the compiler, 
the second one is assigned the number 2, 
and so on. Thus, only ceiling Gog N) bits 
are needed to represent N identifiers in an 
environment. 

How does this mechanism work during 
execution? As mentioned before, each 


Fortran-II environment has a unique 
name space. During execution, an en¬ 
vironment pointer (EP) points to an access 
table of descriptors. Each descriptor con¬ 
tains the type and address of the variable. 
The sum of the EP and the variable num¬ 
ber (from the instruction) forms the ad¬ 
dress of a descriptor, thus generating the 
address of the variable with one level of in¬ 
direction. Separate access tables are used 
for each subprogram, and the EP is 
changed upon CALLing and RETURN- 
ing. A distinct increase in execution speed 
is obtained in DELtran by maintaining the 
access tables in a high-speed writable con¬ 
trol store. 

The interpreter uses a total of only 800 
words of control store; the remaining 3200 
words are dedicated to access tables. The 
space used by the interpreter compares 
favorably to that used by more traditional 
processors (for example, 1200 words for 
the PDP-11 and 2100 words for the 
System/360 2 ). Execution times of the 
Whetstone benchmark favored DELtran 
over traditional architectures by an 
average of five to one. 

Direct-execution 

architectures 

Direct-execution architecture represents 
one extreme in the area of hardware/soft¬ 
ware tradeoffs. Whereas all other classes 
of architectures perform some software 
preprocessing or translation of programs 
before executing them, direct-execution 
architecture executes the high-level lan¬ 


Figure 12. Block 
diagram of a direct- 
execution machine. 


guage source code directly. The direct exe¬ 
cution architecture has some potential ad¬ 
vantages, among them: no compilation, 
one-copy program storage (no object 
files), and a high level of interactiveness. 28 

A typical direct-execution architecture 
differs from von Neumann architectures 
in three ways. 28 First, code and data are 
kept in separate memories. Second, the 
central processor is split into separate pro¬ 
cessors for control and data. Third, the 
control and data processors execute con¬ 
currently. 

The code is stored in a program mem¬ 
ory, while the data is stored in a data 
memory. The lexical processor forms 
tokens from the characters of the source 
program. The tokens are kept in the token 
register during execution of the corre¬ 
sponding HLL statement. The token 
register corresponds to the instruction 
register in a conventional architecture. 
The lexical processor maintains a PM_ 
LOCN register, similar to a program 
counter, that points to the program loca¬ 
tion of the token contained in the token 
register. The lexical processor operates 
concurrently with the control processor 
and the data processor. 

The data processor executes tokens that 
perform data manipulation, such as addi¬ 
tion and multiplication. 28 In addition, the 
data processor manages access to the data 
memory by means of symbol tables and 
storage management routines. The control 
processor executes tokens that change the 
flow of control, such as REPEAT, DO, 
GOTO, and IF. A block diagram of a 
typical direct-execution machine is shown 
in Figure 12. This type of direct-execution 
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The common thread 
among these 
approaches to 
HLL computer 
architecture is their 
attempt to reduce 
dimensions of the 
conceptual gap. 


machine is sometimes referred to as the 
University of Maryland approach. 

In this section, we examine only one 
direct execution architecture in detail 
because of its limited commercial useful¬ 
ness given the current level of technology. 
This architecture is based on the Universi¬ 
ty of Maryland approach. Some other 
work deserves mention, too. 29 ' 30 

Pasdec. The Pasdec project (for Pascal 
interactive Direct Execution Computer) 
was developed at the University of Tsuku- 
ba, Japan. 31 Pasdec consists of a com¬ 
plete system involving a terminal, I/O pro¬ 
cessor, interactive processor, and language 
processor. The motivation for the Pasdec 
project was to increase programmer prod¬ 
uctivity by eliminating the conceptual 
gap 1 between the programming language 
and the target architecture. 

Pasdec executes a small subset of stan¬ 
dard Pascal called Tiny-Pascal. Tiny- 
Pascal has three control constructs, two 
data types, two data structures, and one 
data flow operation. The outstanding 
feature of Pasdec is that all interaction 
with the computer occurs at the level of the 
high-level language. For example, the user 
enters a program under the interactive 
direction of an editor subprocessor. The 
processor immediately detects incorrect 
syntax and allows the user to correct the 
error. This avoids the flood of meaningless 
error messages often encountered on con¬ 
ventional computers. 

Debugging of programs also occurs at 
the level of the high-level language. The 


user can single-step the program at any 
desired granularity and immediately ob¬ 
serve the program’s effect on the data. 
Debugging is under the control of the 
debugger subprocessor. The debugger and 
the editor together form the interactive 
processor. 

Actual execution of the code is handled 
by the language processor, which consists 
of a control subprocessor, a data sub¬ 
processor, a lexical subprocessor and a 
driver subprocessor. The driver subpro¬ 
cessor decides whether the control pro¬ 
cessor or the data processor is to be ac¬ 
tivated, depending on whether the token is 
a control statement (IF, WHILE, etc.) or 
a data-related statement (declaration, as¬ 
signment, etc.). The lexical subprocessor 
obtains the next token from the program 
memory. 


T he common thread among these ap¬ 
proaches to HLL computer archi¬ 
tecture is their attempt to reduce one 
or more dimensions of the conceptual gap. 
RISCs reduce the conceptual gap by tailor¬ 
ing their architectures and instruction sets 
to efficiently execute the kind of code pro¬ 
duced by compilers. The special features 
of a reduced architecture include large on- 
chip register files, easily decoded short in¬ 
structions, an efficient data path between 
the registers and the ALU, and a compiler. 
These features, together with a smart op¬ 
timizer, allow the reduced architecture to 
execute compiled high-level language code 
quickly. In addition, reduced architectures 
require a shorter design and fabrication 
time than the more complex architectures. 
This allows the designers to use newer 
technology for reduced architecture im¬ 
plementations. 

CISCs, on the other hand, migrate 
some or much of the complexity into the 
hardware or firmware. This reduces the 
conceptual gap by making the machine 
instructions look more or less like the 
constructs of one or more HLLs. The 
language-directed microcomputers, for 
example, typically have instructions spe¬ 
cially tailored to execute various high- 
level language constructs. The microcode 
to execute these instructions resides in a 
large control store on the chip. The high 
performance of these machines stems 
from the fast access to the control store 
memory (as opposed to the reduced ar¬ 


chitectures, where the program memory 
can be considered a control store). 

The question remains: How do we best 
invest a given area of a silicon chip? For 
silicon VLSI, the answer remains unclear. 
Certainly, the question of whether to use a 
reduced or complex architecture depends 
on the nature of the application and 
technology in question. For GaAs VLSI 
implementations, however, there is only 
one choice: the reduced architecture. Yield 
problems limit GaAs devices to a transis¬ 
tor count of about 30,000, 32 and the only 
architectures we have discussed here that 
have fewer than 30,000 transistors are the 
reduced architectures. (Of course, trans¬ 
istor count is just one of the relevant 
design parameters for GaAs. 32 ) 

The authors will be happy to supply in¬ 
terested readers with an extended reading 
list on the subject of HLL architecture. 
Contact Milutinovic at Purdue University, 
Mailbox 62, School of Electrical Engineer¬ 
ing, West Lafayette, IN 47907. □ 
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Zemanek receives 
Computer Pioneer medal 
at NCC 

Heinz Zemanek, inventor of the 
Mailuefterl fully transistorized computer, 
received the Computer Pioneer medal 
from the IEEE Computer Society during 
ceremonies held at the National Com¬ 
puter Conference June 18. 

Zemanek launched his career in com¬ 
puter technology in Vienna shortly after 
the close of World War II. In 1949, with 
only used PTT relays at his disposal, he 
built a relay computer model for Vienna 
University of Technology. Later, with a 
$1000 grant he began work on what 
became in 1959 the Mailuefterl (“May 
Breeze”) transistorized computer. Zema¬ 
nek worked at the IBM Laboratory in 
Vienna from 1961 until 1976, when he 
was appointed an IBM Fellow. There his 
group established the formal definition 


of the syntax and semantics of the PL/I 
language. 

Zemanek was founding president of 
the Austrian Computer Society and has 
been active for many years in the Inter¬ 
national Federation for Information 
Processing. His honors include the 
Leonardo da Vinci Medal of the Euro- 


Heinz Zemanek 
(center) receives the 
Computer Pioneer 
medal from Com¬ 
puter Society Presi¬ 
dent Roy Russo 
(left) and Awards 
Committee repre¬ 
sentative Marshall 
Yovits (right). 

pean Federation for Engineering Educa¬ 
tion. He was elected an IEEE Fellow in 
1970. 

The Computer Pioneer medal honors 
individuals who have made significant 
contributions to early concepts and de¬ 
velopments in the electronic computer 
field. 


Teradata wins AFIPS Product of the Year award 


Teradata Corporation’s DBC/1012 
Database Computer System received the 
first annual Product of the Year award 
in the systems category from the Amer¬ 
ican Federation of Information Process¬ 
ing Societies and Fortune magazine. The 
award was presented at the National 
Computer Conference in June. 

The Product of the Year award was 
established by AFIPS to honor outstand¬ 
ing computer products in three catego¬ 
ries: hardware, software, and systems. In 
the hardware category, Plus Develop¬ 
ment Corporation garnered the top 
award for Hardcard, a lOM-byte hard 
disk drive that fits into a single IBM PC 
expansion slot. In the software category, 
Telos Software Products won for 
Business Filevision, a visually oriented 
database management system for the 
Apple Macintosh. 

The DBC/1012 is a parallel processing 
system that employs between six and 
1024 microprocessors, each capable of 
independently accessing a common 
database by means of the company’s 
Ynet intelligent network. The system 
uses a version of the SQL database query 
language. In June, the company shipped 
a 128-processor version of the DBC/1012 
to Citibank of New York. At 130 MIPS 



AFIPS President Stephen S. Yau (left) 
presents the 1986 NCC Product of the 
Year award to Jack E. Shemer (right), 
CEO of Teradata Corporation. Shemer is 
a former technical editor of Computer 
magazine. 


throughput, Teradata claims that this 
system is the most powerful ever built 
for a business application. 

Teradata was founded in 1979 by Jack 
E. Shemer, who has served since then as 
the company’s chief executive officer. 
Shemer was technical editor of Com¬ 
puter magazine from 1973 to 1976, and 
served from 1977 to 1978 as a member 
of the Computer Society’s Board of 
Governors. 

This year’s award-winning products 
were chosen from a field of 72 entries 
from 58 US companies. Fortune, a 
biweekly business publication of Time, 
Inc., sponsors the awards. 


Fiber optic chip runs at 400M bps 


Scientists at IBM facilities in East 
Fishkill and Yorktown Heights, New 
York, have developed an experimental 
fiber optic computer chip capable of 
receiving data from I/O devices at a rate 
of up to 400 million bits per second. 

The new chip, which measures 4.70 
mm square, uses fiber optic lines to 
receive laser light pulses that convey in¬ 
formation between a central processing 
unit and its peripherals. A separate 
photo detector converts the optical 
pulses into electrical signals, which in 


turn are converted by a high-speed 
receiver circuit on board the chip into 
standard digital logic signals. At 400M 
bps, the chip could conceivably receive 
the entire text of a 20-volume encyclo¬ 
pedia in less than three seconds. 

Fiber optic transmission offers greater 
data integrity and security than standard 
data transmission over twisted-pair elec¬ 
trical wires. Interference in a fiber optic 
line transmission interrupts the flow of 
data, thus warning the system of a line 
failure or an attempted security breach. 
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NEW PRODUCTS 


Editor: Demetrios Michalopoulos/California State University, Fullerton 

LabVIEW employs “graphical diagramming" 


National Instruments Corp. has 
announced LabVIEW (for Laboratory 
Virtual Instrument Engineering Work¬ 
bench), which the company claims incor¬ 
porates a new programming technology 
called graphical diagramming. LabVIEW 
runs on the Macintosh computer. It pro¬ 
vides an environment for developing 
scientific applications that integrate in¬ 
strument control, data acquisition, data 
analysis, data entry, data management, 


Intel Corp. has announced the 82786 
graphics coprocessor, which supports 
two independent, on-chip processors for 
manipulating graphics and text while 
executing multiple windows. The 82786 
operates independently of the host cen¬ 
tral processing unit. 

According to the company, Intel 
designed the 82786 to work with all Intel 
microprocessors, as a replacement for 
subsystems and boards that use discrete 
components and software for imple¬ 
menting graphics functions. The 82786 


MIPS Computer Systems, Inc. has 
announced that it will supply system 
building blocks based on its proprietary 
32-bit reduced instruction set computer 
(RISC) architecture to original equip¬ 
ment manufacturers (OEMs). The R2000 
Series includes component kits, CPU 
boards, and a development system for 
design and software engineers. 

The R2100, R2300, and R2600 CPU 
boards reputedly provide performance at 
3, 5, and 8 MIPS, respectively. Each 
board contains the company’s CPU 
chip, instruction and data caches, sepa¬ 
rate buses for I/O and computation, 
memory interface circuitry, and floating 
point support. (A floating point acceler¬ 
ator chip will be available in sample 
quantities in 1987.) The boards are 
volume priced at $3170 (R2100), $4775 
(R2300), and $6420 (R2600). The R2350 
memory board provides 4M bytes of 
memory for the R2300 and R2600 for a 
cost of $3300. 

The R2065/12 and R2065/16 com¬ 
ponent kits consist of a 12.5- or 16.7- 


and report generation. 

According to the company, LabVIEW 
uses the intuitive concepts of front 
panels and block diagrams as tools for 
software development. The user views 
his or her applications as a model virtual 
instrument and creates a front panel to 
interact with it. Next, he or she designs a 
block diagram showing the flow of data 
from input controls or terminals, through 
internal processing functions to output 


uses the same CHMOS-III (complemen¬ 
tary high-performance metal-oxide 
semiconductor) process as the 80386 mi¬ 
croprocessor. 

The 82786 graphics coprocessor is 
available in sample quantities for less 
than $100 in quantities of 1000. Volume 
production is scheduled for the fourth 
quarter of 1986. For more information, 
contact Intel Corp., Literature Dept. 
W-300, 3065 Bowers Ave., Santa Clara, 
CA 95051. 
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MHz CPU chip reputedly capable of 8 
or 10 MIPS sustained performance, 
respectively; the floating point accel¬ 
erator chip; a write buffer consisting of 
four gate array chips; and a binary copy 
of the UMIPS operating system, utilities, 
and C optimizing compiler. Prices are 
$1750 and $2250, respectively, for the 
12.5-MHz and 16.7-MHz versions. 

The M/500 development system is 
configured around the R2300 CPU 
board in a 12-slot VMEbus cardcage. 

The base configuration includes 4M 
bytes of main memory, a 337M-byte disk 
drive, a 60M-byte quarter-inch cartridge 
tape drive, an Ethernet controller, and 
eight serial I/O ports. The basic system 
costs $59,900 and includes the UMIPS 
operating system. 

For more information, contact MIPS 
Computer Systems, Inc., 930 Arques 
Ave., Sunnyvale, CA 94086; (408) 
720-1700. 
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display terminals. According to the com¬ 
pany, the block diagram is the executable 
virtual instrument, analogous to an exe¬ 
cutable subroutine or program in the 
traditional programming environment. 

LabVIEW costs $1995. For more in¬ 
formation, contact National Instruments 
Corp., 12109 Technology Blvd., Austin, 
TX 78727-6204; (800) 531-4742. In 
Texas, (800) 433-3488. 
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Teamwork/SD automates 
software design 

Cadre Technologies Inc. has an¬ 
nounced Teamwork/SD, the second 
computer-aided software engineering 
(CASE) product in the company’s Team¬ 
work family of automated software de¬ 
velopment tools. Teamwork/SD provides 
an environment for automated support 
of the design phase of software devel¬ 
opment. The program runs on worksta¬ 
tion-based supermicrocomputers, includ¬ 
ing the Apollo Domain, IBM RT-PC, 
and Sun Microsystems. According to the 
company, it supports the structured 
philosophy of decomposing interrelated 
system components into manageable 
tasks while minimizing the interfaces be¬ 
tween components. 

Graphic modeling techniques support 
the development of structure charts. An 
editing system supports modules, invoca¬ 
tions, couples, and connectors. Software 
engineers can enter code, pseudocode, or 
other textual descriptions. Teamwork/ 

SD also features syntax and complete¬ 
ness checking for either a single sheet or 
an entire structure chart. 

Teamwork/SD is designed for use 
alone or with the company’s Team- 
work/SA, an automated structured 
analysis tool. A data dictionary shared 
between the two products permits soft¬ 
ware engineers using Teamwork/SA to 
use the same data in Teamwork/SD. 

The software-only price is $8900. Cur¬ 
rent users of Teamwork/SA can pur¬ 
chase Teamwork/SD for $3600. A com¬ 
bined package costs $12,500. As a 
turnkey system, prices start at $17,300. 
For more information, contact Cadre 
Technologies Inc., 222 Richmond St., 
Providence, RI 02903; (401) 351-5950. 
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Intel has single-chip graphics coprocessor 


System building blocks for OEMs employ RISC architecture 
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Honeywell brings out new minis, OS 


Honeywell Inc. has announced the 
first models of the DPS 6 Plus family of 
32-bit, midrange minicomputers, which 
run an expanded operating system called 
the Honeywell Virtual System (HVS) 6 
Plus. According to the company, the 
family will be the foundation for depart¬ 
mental-level computing in Honeywell’s 
Office Network Exchange architecture. 

The systems include shadow proces¬ 
sors for data integrity, demand paging 
with segmentation, and a local area con¬ 
troller subsystem. The new models will 
be available in two basic configurations: 
Model 410, with a 16-slot chassis, and 
Model 420, with a 32-slot chassis. Model 
410 has 1 to 4 processors, 4-16M bytes of 
memory, 16 slots, 1 LAN controller, 64 
communications ports, and 3 peripheral 
ports. Model 420 has 1 to 4 processors, 
8-64M bytes of memory, 32 slots, 2 LAN 
controllers, 160 communications ports, 
and 7 peripheral ports. 

System prices range from $90,000 for 
an eight-user, one-processor Model 410 
with cartridge tape, laser printer, and 
142M-byte disk, to $410,000 for a quad 


Sharp adds AT-compatible 

Sharp Electronics Corp. has an¬ 
nounced the PC-7500, an IBM PC-AT- 
compatible desktop model computer. 

The standard PC-7500 features an Intel 
80286 processor running at either 6-MHz 
or 8-MHz clock speed, a combination of 
double-density or high-density half¬ 
height floppy disk drives, and 512K bytes 
of RAM, expandable to either 640K 
bytes or 1M byte on the main board, 
with 16M bytes addressable on expansion 
boards. 

Other standard features include a 
compatible keyboard with PC and XT 
software interfaces, a 16-bit hard disk 
controller board, serial and parallel ports 
on the main board, and a socket for an 


processor 420 for 60 users with 12M 
bytes of main memory, 1,5G bytes of 
mass storage, 10 Model 34 printers, and 
a tape drive. Price includes the HVS 6 
Plus operating system. All systems will 
be available in September 1986. 

Honeywell’s Office Network Exchange 
(ONE) Plus runs on the DPS 6 Plus 
family of minicomputers. The modular, 
integrated departmental software system 
includes communications, multimedia 
document creation, and intersystem 
database access. Modules available in 
September are electronic mail, document 
processing, time management, electronic 
spreadsheet, list processing, and asyn¬ 
chronous communications. Modules 
available in December are advanced doc¬ 
ument processing, a document library, a 
compund document feature, and a de¬ 
partmental information base. Prices vary. 

For more information, contact 
Honeywell Inc., 300 Concord Rd., Bil¬ 
lerica, MA 01821; (617) 671-2744. 
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80287 math coprocessor. MS-DOS 3.1 
and GW-Basic 3.1 software are included. 

The PC-7500 costs $2995. For more 
information, contact Systems Division, 
Sharp Electronics Corp., Sharp Plaza, 
Mahwah, NJ 07430; (201) 529-8971. 

Sharp has also announced the 
PC-1600, a handheld computer that 
features random access floppy disk 
storage. The PC-1600 has an optional 
1.5-inch micro floppy disk drive with 
128K bytes of mass storage. The com¬ 
puter is software-compatible with 
Sharp’s PC-1500A. Its microprocessor 
is a Z-80A-like proprietary design. 

According to the company, the 
PC-1600 measures 3.5 by eight inches 


Image processor reads all type fonts with no preprocessing 


The Palantir Corp. has announced 
its Palantir Compound Document Pro¬ 
cessor, or CDP, an optical character 
reader/image processor the company 
claims is capable of capturing many dif¬ 
ferent page layouts and type fonts, all 
without preprocessing. Its proprietary 
character recognition algorithms are 
contained in a computing engine that 
combines five Motorola 68000 micropro¬ 
cessors with custom-designed ICs and a 
parallel-processing computer architec¬ 
ture. The computing engine turns text or 
images into ASCII computer code or bit¬ 


mapped raster images processible and 
storable by a host computer. 

According to the company, the prod¬ 
uct automatically adapts to each page, 
differentiates between textual informa¬ 
tion and graphics, locates all text be¬ 
tween 6- and 28-point size, and extracts 
the text and images. Images are scanned 
at a resolution of 300 dots per inch. This 
image data can be compressed and trans¬ 
mitted according to CCITT Group 3 and 
4 facsimile standards. Text can be 
retrieved without accompanying images, 
or vice versa. 


Development system runs 
at 16.7 MHz 

Force Computers, Inc., has an¬ 
nounced the miniForce-2P21, a real-time 
32-bit multiuser hardware and software 
development system that reputedly runs 
at 16.7 MHz without wait-states. It pro¬ 
vides a VMEbus-based open system ar¬ 
chitecture for single- or multiuser devel¬ 
opment. The 2P21 contains the CPU-21 
with the 68020 CPU and the 68881 
floating-point coprocessor and employs 
the PDOS operating system. 

The 2P21 also features 512K-byte 
zero-wait-state SRAM, a 51M-byte Win¬ 
chester disk, a lM-byte floppy disk, two 
RS-232 serial ports, six VMEbus expan¬ 
sion slots, a CRT editor, and 68000 
macro assembler. C, Fortran-77, and 
Pascal are available upon request. 

The miniForce-2P21 costs $15,990. 
Contact Force Computers, Inc., 727 
University Ave., Los Gatos, CA 95030; 
(408) 354-3410. 
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and weighs about one pound. The dis¬ 
play measures four lines by 26 charac¬ 
ters. The standard configuration includes 
16K bytes of RAM, expandable to 80K 
bytes. The unit also has a fiber optics in¬ 
terface, an analog input, and an RS-232- 
C serial port. 

The PC-1600 alone costs $345. The 
CE-1600F micro floppy drive costs $210, 
the CE-1600P four-color pen plotter/ 
cassette interface costs $315, and the CE- 
1600M 32K-byte RAM module costs 
$155. Contact Systems Division, Sharp 
Electronics Corp. 
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CDP communicates with host com¬ 
puters or networks through standard 
serial RS-232 and optional Multibus and 
Ethernet interfaces. It comes with a 
series of Unix- and PC-DOS-based pack¬ 
ages for file editing and correction and 
forms process, as well as a set of libraries 
and utilities for integrating it into users’ 
systems. 

The Compound Document Processor 
is priced at $39,550. Contact The Palan¬ 
tir Corp., 2500 Augustine Dr., Santa 
Clara, CA; (408) 986-8006. 
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IBM brings out new systems, RT enhancements 


IBM Corp. has announced three new 
models of the IBM System/36 computer, 
six new models of the System/38 com¬ 
puter, a touch-screen display called Info- 
Window, and a series of enhancements 
for the IBM RT-PC. 

The System/36 high-end 5360 model 
D offers a maximum of 7M bytes of 
main memory. The System/36 mid-range 
5362 models B and C provide a max¬ 
imum of 2M bytes of main memory. The 
model C supports the IBM 9332 direct 
access storage device (DASD). Existing 
models of the 5360 are field upgradeable 
to the model D; existing models of the 
5362 are field upgradeable to the models 
B and C. All System/36 processors will 
be available in the fourth quarter of 1986 
except the 5360 model D, available in 
the first quarter of 1987. Prices for the 
5360 model D range from $67,850 to 
$106,850. Prices for the 5362 models B 
and C range from $17,625 to $27,625. 
The 9332 DASD costs $8500 per 200M 
bytes of storage. It will be available in 
the fourth quarter of 1986. 


Apollo introduces CASE 
tools 

Apollo Computer Inc. has announced 
the Domain Performance Analysis Kit, 
the Domain/PAK. This set of computer- 
aided software engineering tools reputed¬ 
ly helps software developers enhance the 
execution speeds of their programs. 

Domain/PAK runs on all Apollo 
operating systems, including Domain/IX. 
It consists of three tools: the 
Domain Performance Analysis Tool 
(DPAT), the Display Process Status 
(DPST), and the Histogram Program 
Counter (HPD). The DPAT refines per¬ 
formance data to the procedure level by 
profiling processing time and page faults 
of each procedure in a program. It con¬ 
tinuously displays a list of the most ac¬ 
tive procedures and current call stack. 
Matching filters, procedure grouping, 
and hierarchical reports allow users to 
identify performance problems. The 
DPST explores how an application 
shares system level resources with 
concurrently-running processes. The 
HPC shows execution time within a 
given procedure, with resolution 
capabilities down to a single line of code. 

Domain/PAK costs $250 per node, 
$980 per site. For further information, 
contact Apollo Computer Inc., 330 
Billerica Rd., Chelmsford, MA 01824; 
(617) 256-6600. 
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The six new IBM System/38 proces¬ 
sors, models 100, 200, 300, 400, 600, and 
700, replace the current models 4, 6, 18, 
20, and 40. With the exception of model 
100, all models can be upgraded to the 
new models. Purchase prices range from 
$37,500 to $385,490. 

The IBM InfoWindow features a 
touch-screen display. Presentation mate¬ 
rial can be stored on a PC or videodisc 
and is seen and heard in any combina¬ 
tion of computer graphics, full-motion 
or still video, and audio. The system 
consists of an intelligent terminal with its 
own microprocessor. The display con¬ 
trols audio, video, and auxiliary func¬ 
tions. Other hardware includes an IBM 
PC, PC-XT, or PC-AT with the en¬ 
hanced graphics adapter card. The 
adapter card enables the text and 
graphics stored on the PC to be super¬ 
imposed over the video stored on the 
videodisc player. The videodisc player is 
connected to the display, not the PC. 
Other features include RS-232-C and 
IEEE 488 interfaces, support for two 


5010 laser printer relies on 
cartridges 

Genicom Corp. has announced the 
5010 laser printer. The printer uses a 
laser engine that includes an 80186 and a 
proprietary image processor. The laser 
engine includes a two-part toner/devel¬ 
oper system and an organic photo¬ 
conductor drum cartridge. The drum, 
developer, and toner cartridges can be 
replaced separately. 

Two “personality” cartridges are 
available: one emulates the IBM Graph¬ 
ics Printer and the Diablo 630; the other 
incorporates the Hewlett-Packard Laser 
Jet protocol. The company also plans to 
introduce additional personality car¬ 
tridges. Original equipment manufac¬ 
turers have the option (based on volume 
ordered) of specifying custom personali¬ 
ty cartridges. 

Font selection also relies on cartridges. 
Up to four font cartridges can be on-line 
by inserting them into the available slots. 
According to the company, users will be 
able to choose from 1,500 fonts ported 
from the Bit Stream and Genicom char¬ 
acter libraries. 

The 5010 laser printer sells for $3495. 
For more information, contact Thayton 
Traughber, Genicom Corp., Genicom 
Dr., Waynesboro, VA 22980; (703) 
949-1188. 
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videodisc players, two integrated audio 
speakers, and a voice synthesis chip. The 
IBM InfoWindow display costs $4195. 
The IBM Enhanced Graphics Adapter 
Jumper Card costs $40. 

IBM RT-PC enhancements include 
three adapter cards. The baseband 
adapter card ($850) permits the RT-PC 
to be connected to Ethernet local area 
networks. It is mounted internally and 
plugs into an 8- or 16-bit slot. The 
Token-Ring Network adapter card 
($1095) connects the RT-PC to IBM’s 
Token-Ring Network. The adapter con¬ 
tains a microprocessor with microcode 
that lets users write their own application 
programs. The multiprotocol communi¬ 
cations adapter card ($3400) provides 
telecommunications capability. 

For more information on any of these 
products, contact International Business 
Machines Corp., Information Systems 
Group, 900 King St., Rye Brook, NY 
10573; (914) 934-4488. 
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Portable PC features backlit 
LCD screen 

Zenith Data Systems has announced 
the Z-181 Portable PC with a full-size, 
backlit liquid crystal display screen; 
shock-mounted 3.5- inch disk drives; and 
640K bytes of RAM. The screen features 
a 25-line-by-80-character display with a 
true aspect ratio (without compression of 
images). The contrast ratio of 12-to-l 
can be adjusted to fit varied lighting. 

The Z-181 can be used with an exter¬ 
nal RGB color or monochrome com¬ 
posite monitor. The computer is IBM 
PC-XT compatible and uses an Intel 
80C88 processor running at 4.77 MHz 
clock speed. It includes a socket for in¬ 
stallation of an 8087 numeric coproces¬ 
sor. An interface for an external 
5.25-inch disk drive permits transfer of 
information to the smaller drives. An 
RS-232 serial port permits connection to 
printers, modems, and other I/O de¬ 
vices. A parallel port is standard. The 
computer also comes with an AC 
adapter/charger, real-time clock, soft¬ 
ware, and the MS-DOS 3.2 operating 
system. Options include an internal 
modem (operating at 300 or 1200 baud) 
and a bar-code reader. 

The system costs $2399. Contact 
Zenith Data Systems, 1000 Milwaukee 
Ave., Glenview, IL 60025; (312) 

391-8949. 
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Recent Microsystem Announcements 


For more information, circle the appropriate RS No. on the Reader Service Card at the back of the magazine. 

MANUFACTURER ~~ ' -- 

AND MODEL _ FUNCTION COMMENTS 


Baker & Rabinowitz, Operating 
Inc., system 

Pax System 8632 


This software development system is a multitasking and real-time executive 70 
designed for the IBM PC family. Individual tasks can be written and debugged 
under DOS, then ported to Pax. Price: $149.95; $24.95 for documentation only. 

Baker & Rabinowitz, Inc., 3869 Kilbourne Ave., Cincinnati, OH 45209-1814; (513) 
871-nfiflR ' 


Data Translation, Operating 

PC Semper environment 


Digital Equipment Workstation 
Corp., 

VAXstation ll/RC 


Express Systems, Hard disk 

Inc., cards 

Hard DiskCards 


Graphic Software Graphical 

Systems, Inc., Kernel 

GSS*GKS System 

development 

package 


Designed for the IBM PC family, PC Semper provides more than 100 image- 71 

processing and analysis functions implemented by means of IEEE 32-bit 
floating-point arithmetic. Two versions available: one ($1995) supports Data 
Translations’s DT2851 frame grabber (512 x 512 x 8) and the other ($1495) 
supports the company’s DT2803 frame grabber (256 x 256 x 6). Data 
Translation, Inc., 100 Locke Dr., Marlboro, MA 01752; (617) 481-3700. 

This workstation, which is based on the MicroVAX II processor, includes a 72 

19-inch monochrome monitor, 71M-byte disk, disk controller, Ethernet interface, 
and 95M-byte streaming tape drive. It is supported on both MicroVMS and 
Ultrix-32M. Available in two versions: fixed package systems of 3M-byte 
($14,995) and 5M-byte ($16,995) configurations. Digital Equipment Corp., 

Maynard, MA 01754-2571. 


This family of seven cards is designed for the IBM PC and compatibles and 
ranges from the 20M-byte AT Backup DiskCard ($449) for the IBM PC AT to the 
60M-byte Hard DiskCard ($1095). They vary in performance as to combinations 
of 60 msec or 80 msec average access speeds at 5M or 7.5M bps transfer rates. 
Express Systems Inc., 1254 Remington Rd., Schaumburg, IL 60195; (800) 
341-7549, ext. 3600 or (312) 882-7733, ext. 3600 (in Illinois). 

GSS*GKS is designed for IBM RT PC graphics programmers. It makes possible 
porting of GKS-based applications from other systems to the RT because it 
implements the ANSI standard Graphical Kernel System to the full “level 2b” 
specification, plus “level c” sampled input. Price: $795. Graphics Software 
Systems, Inc., 9590 SW Gemini Dr., PO Box 4900, Beaverton, OR 97005' (503) 
641-2200. ' 


73 


74 


Hewlett-Packard Coprocessor 

Co., interface card 

HP Series 300 DOS 
coprocessor 


These cards provide IBM PC AT/MS-DOS software compatibility to the HP 9000 75 

Series 300 technical workstation. Prices vary depending on model chosen. 
Hewlett-Packard, PO Box 10301, Palo Alto, CA 94303-0890; (415) 857-1501 


IDEAssociates, Inc., Winchester 
Diskit 2 Plus disk drive 


Intel Corp., Software 

iPAT Performance analysis tool 

Analysis Tool 


Micro Focus, 
Compact Level II 
Cobol/ET, 
Animator, 
Forms-2, 

Upgrade III 


Cobol 

development 

toolkit 


This external Winchester disk drive for the IBM PC family features hardware- 76 
based encryption and removable cartridges. Diskit 2 Plus supports the NBS- 
certified Data Encryption Standard. Price: $3595. IDEAssociates Inc 29 
Dunham Rd., Billerica, MA 08121; (617)663-6878. 

iPAT is hosted on the IBM PC XT and PC AT, and on Intel development systems. 77 
It supports operation with Intel’s l 2 ICE Integrated In-Circuit Emulator for 
analysis of software based on 8086, 8088, 80186, 80188, and 80286 microproces¬ 
sors. Price: $9995. Intel Corp., Literature Dept. W-303, 3065 Bowers Ave., Santa 
Clara, CA 95051. 

The four named products are available in IBM PC RT versions. Compact Level II 78 
Cobol/ET ($2000) is an ANSI 74 Cobol compiler; Animator ($1200) is a tool for 
migrating applications from one environment to another; Forms-2 ($400) is a 
screen painter and prototyping tool; Upgrade III ($400) is a preprocessor and 
migration aid. The four can be purchased as a bundled product. Micro Focus 
2465 E. Bayshore Rd., Palo Alto, CA 94303. 


Toshiba America, Portable 

Inc., personal 

TIIOOPIus computer 


An IBM-compatible laptop, the T1100 Plus is equipped with two 3.5-inch 720 79 

kilobyte diskette drives, 256K-byte ($1999) or 640K-byte ($2399) memory, and 
parallel and serial ports. It runs on an 80C86 16-bit processor at 7.16 MHz clock 
speed. Toshiba America, Inc., Information Systems Division, 2441 Michelle Dr 
Tustin, CA 92680; (714) 730-5000. 
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Recent 1C Announcements 


For more information, circle the appropriate RS No. on the Reader Service Card at the back of the magazine. 


MANUFACTURER 

AND MODEL FUNCTION COMMENTS 


Fairchild 

Semiconductor 

Corp., 

FGE6300 


ECL gate array An ECL gate array with 6300 equivalent gates, internal propagation delays of 90 

225 ps, and 560 internal cells. Meets 565 MHz telecommunications system 
frequency requirements. Packaged in a multilayer ceramic 301-pin pin grid array 
(PGA). Price: $595 (100’s) for commercial version. Fairchild Information Center, 

Fairchild Semiconductor, Gate Array Div., 1801 McCarthy Blvd., Milpitas, CA 
95035; (800) 554-4443. 


Hybrid Systems DACs 
Corp., 

HS7528, HS7628 


Dual and quad digital-to-analog converters incorporating multiple 8-bit D/A 
converters, two and four to a chip, respectively. Both offer four-quadrant mul¬ 
tiplication with a separate reference input and feedback resistor for each DAC. 
Price: From $5.80 to $29.30 (100’s) for the HS7528; from $8.92 to $44.80 (100’s) 
for the HS7628. Hybrid Systems Corp., 22 Linnell Circle, Suburban Industrial 
Park, Billerica, MA 01821; (617) 667-8700. 


Honeywell Inc., DACs 
HDAC51400; 

HDAC10180 (A and 
B), HDAC10181 (A 
and B) 


The 400-MWPS HDAC51400 targets raster graphic video display. It is available in 92 
a 24-lead Cerdip, ceramic-sidebrazed DIP, and LCC. The HDAC10180A and 
HDAC10181A operate at 275 WMPS; the HDAC10180B and HDAC10181B, at 165 
MWPS. The HDAC10180, A and B, are pin-compatible with TRW’s TDC1018. 

Price: $58.31 (1000’s) for the HDAC51400 Cerdip; $42.50 (100’s) for the 
HDAC10180A Cerdip; $32.50 (100’s) for the HDAC10180B Cerdip; $43.75 (100’s) 
for the HDAC10181A Cerdip; $33.75 (100’s) for the HDAC10181B Cerdip. 

Marketing Communications Dept., Honeywell Inc., Signal Processing 
Technologies, 1150 E. Cheyenne Mtn. Blvd., Colorado Springs, CO 80906; (800) 
328-3423. 


Raytheon Co., DAC An 8-bit microprocessor-compatible digital-to-analog converter in a 24-pin DIP. 93 

DAC-4888 Includes functions to build a D/A conversion system. Available in three grades, 

F and D over the commercial temperature range and B over the military 
temperature range. Price: in 100’s, $7.14 for DAC-4888DD; $10.14 for DAC- 
4888FD; $20 for DAC-4888BD. Raytheon Co., Semiconductor Div., 350 Ellis St., 
Mountain View, CA 94043; (415) 966-7716. 


National 

Semiconductor 

Corp., 

SCL family 


Standard cells Include fixed-height cells and functional blocks. Fixed-height cells include over 94 
150 common small-scale integration (SSI) and medium-scale integration (MSI) 
logic functions. Functional blocks available include single- and dual-port RAMs; 
programmable logic arrays; CMOS and TTL-compatible I/O functions; the 16450 
UART; and high-current output drivers. Prices vary. National Semiconductor 
Corp., 2900 Semiconductor Dr., PO Box 58090, Santa Clara, CA 95052-8090; (408) 
721-5098. 


Oki DSP 1C 

Semiconductor, 

M6992 


A 22-bit, CMOS floating-point digital signal processor (DSP) 1C available in engi- 95 
neering samples. Has an instruction cycle time of 100 ns, a clock rate of 40 
MHz, throughput of 20 Mflops, and a dynamic range of 480 dB. Currently avail¬ 
able in a 132-pin PGA. Price: $250, with a tooling charge of $4500. Oki 
Semiconductor, 650 N. Mary Ave., Sunnyvale, CA 94086; (408) 720-1900. 


Standard UART; 

Microsystems controller 

Corp., 

COM78808; 

COM1553BLL 


The COM78808 octal universal asynchronous receiver transmitter consists of 96 
eight UARTs on one chip, eight independent baud-rate generators, and a flag 
scanner. Available in gull-wing or straight-lead frame versions. The 
COM1553BLL is a Mil-Std-1553B controller in a surface-mount ceramic leadless 
chip carrier. It contains the major functions needed to implement Mil-Std-1553B. 

Price: $87 for COM78808; $112.15 for COM1553BLL (100’s). Standard Microsys¬ 
tems Corp., 35 Marcus Blvd., Hauppauge, NY 11788; (516) 273-3100. 


Toshiba America, 
Inc., 

TC22SC Series 


Standard cells The TC22SC Series of standard cells uses Toshiba’s proprietary two-micron 97 

CMOS process and is compatible with the TC21SC standard-cell series and the 
TC17G gate array family. Price varies. Contact Toshiba America, Inc., 1220 Midas 
Way, Sunnyvale, CA 94086; (408) 733-3223. 
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NEW LITERATURE 


IMC book on electronic mail. Electronic 
Mail and Message Handling by Peter 
Vervest, distributed by the International 
Information Management Congress, 
comprehensively surveys major systems, 
services, technology, and standardization 
issues in this rapidly expanding field. 
This 237-page illustrated and indexed 
hardbound book also contains an intro¬ 
duction to the subject and a bibliog¬ 
raphy. International Information Man¬ 
agement Congress, PO Box 34404, 
Bethesda, MD 20817-0404; (301) 
983-1604. All orders for this publication 
(designated IMC-175) should be prepaid 
in US currency. Surface mail price: $49; 
airmail: $60. 

Guide for professional business graphics. 

Hewlett Packard’s four-color, 16-page A 
Personal Guide to Professional Business 
Graphics lists procedures for preparing 
graphics presentation. Six steps cover 
various aspects of design, format, label¬ 
ing, and color. All charts and graphs 
provide some kind of data comparison. 
Hewlett Packard; 3000 Hanover St., 

Palo Alto, CA 94304. Free. 

Volume 2 of Enhancing Your Apple 
lHIIe. Don Lancaster’s second volume 
of this title provides readers with more 
ready-to-use software and hardware con¬ 
cepts of these Apple systems. Readers 
will learn to microjustify and propor¬ 
tionally space Applewriter He, eliminate 
the Wolfenstein SS, capture source code 
for custom modification, and use soft¬ 
ware to create a jitter-free screen lock. 
Howard W. Sams & Co., Dept. R41, 

4300 W. 62nd St., Indianapolis, IN 
46268. 

The Report Store Complete Catalogue. 
This catalogue lists 1,600 technical 
reports, books, proceedings, and other 
publications related to design and use of 
advanced computing technology. It also 
describes the company’s literature com¬ 
pilation process, contract literature 
analysis research services, six literature 
guides, and distribution channel for tech 
reports and conference proceedings. The 
Report Store, 910 Massachusetts St., 

Ste. 403-7AK, Lawrence, KS 66044; 
$12.50 prepaid, with $12.50 refund for 
purchases $25 and above. 

American National Standard X3.155- 
198X. The Accredited Standards Com¬ 
mittee on Information Processing sys¬ 
tems is soliciting public review and com¬ 


ment for a four-month period (July 18, 
1986-November 18, 1986) on a draft of 
this standard for a 5 !4-inch rigid disk 
removable cartridge. The standard speci¬ 
fies the general and physical require¬ 
ments for interchangeability of the 
single-disk cartridge as required to 
achieve unrecorded cartridge interchange 
between disk storage drives and associ¬ 
ated information-processing systems. 
Copies may be obtained from Global 
Engineering Documents, Inc.; (800) 
854-7179; $20. 

Technology impact report. This book 
says that the Pentagon’s VHSIC (Very 
High Speed IC) development program 
will “power the move to submicron IC 
technology” and that “there is no ques¬ 
tion that VHSIC breakthroughs could 
catapult the United States into the lead 
over the Japanese in the IC race again.” 
It describes the program, Phase 2 of 
which has as its goal ICs utilizing 
0.5-micron geometries and 100-MHz 
minimum clock speeds, and the pro¬ 
gram’s potential effects on military ap¬ 
plications and on commercial IC equip¬ 
ment and device markets. The VHSIC 
Program’s Impact on the Commercial IC 
Market, Electronic Trend Publications, 
10080 N. Wolfe Rd„ Suite 372, Cuper¬ 
tino, CA 95014; (408) 996-7416; $985. 

IWS in the workplace. Aimed at infor¬ 
mation systems managers, this study 
finds that intelligent workstations (IWS) 
offer communications capabilities that 
are unavailable in microcomputers, as 
well as local processing advantages that 
terminals cannot provide. It gives advice 
on deciding whether the use of IWS will 
benefit a company and, if the answer is 
affirmative, how to integrate such work¬ 
stations into the firm. Intelligent Work¬ 
stations: Connecting the End User, In¬ 
put, 1943 Landings Dr., Mountain View, 
CA 94043; (415) 960-3990. 

Guide to 1985 Federal R&D. This 
volume describes approximately 1200 
processes, inventions, software, tech¬ 
niques, and pieces of equipment devel¬ 
oped by and for federal agencies during 
1985. The applied technology included 
was chosen for its commercial potential 
and/or promising applications to the 
fields of computer technology, energy, 
electrotechnology, engineering, life 
sciences, machinery and tools, manufac¬ 
turing, materials, physical sciences, and 
testing and instrumentation. Federal 


Technology Catalog—A Guide to New 
and Practical Technologies 1985, Na¬ 
tional Technical Information Service, US 
Dept, of Commerce, Springfield, VA 
22161; (703) 487-4650; $25. 

New Mumps Users’ Group publication. 

This first volume of a publication to be 
issued quarterly by the Mumps Users’ 
Group is divided into sections on 
melding Mumps with Prolog, the crea¬ 
tion of an expert system in Mumps, tech¬ 
niques for Mumps programmers, and 
Mumps portability. Mumps originated as 
the Massachusetts General Hospital 
Utility Multiprogramming System; it is 
used in more than 10,000 medical, com¬ 
mercial, and industrial applications 
worldwide and is both a programming 
language and a data management sys¬ 
tem. Artificial Intelligence, Mumps, and 
Fifth-Generation Computing, Vol. 1, 
Mumps Users’ Group, Suite 510, 4321 
Hartwick Rd., College Park, MD 20740; 
(301) 779-6555; $7.70. 

Factory communication systems newslet¬ 
ter. The MAPNetter follows hardware 
and software developments relating to 
the General Motors Manufacturing 
Automation Protocol, or MAP, and the 
Boeing Company’s Technical Office Pro¬ 
tocol, or TOP, communication protocols 
used for integrating multivendor en¬ 
vironments in business offices and fac¬ 
tories engaged in engineering and manu¬ 
facturing. Architecture Technology 
Corp., PO Box 24344, Minneapolis, MN 
55424; (612) 935-2035. US residents, 

$372 per year; others, $432. 

Single-mode optical fiber discussed and 
dispersion measurement therein de¬ 
scribed. Single-Mode Fiber Applications: 
Questions and Answers (R-36) by Scott 
A. Esty answers often-asked questions 
about single-mode optical fiber use, 
while Multiple-Wavelength System for 
Characterizing Dispersion in Single- 
Mode Optical Fibers (TR-52) describes a 
new system used by Corning Glass 
Works for characterizing chromatic 
dispersion in single-mode optical fibers. 
The paper, by Robert Modavis and 
Walter F. Love of the company’s R&D 
division, can be used for quality assur¬ 
ance and process feedback in shifted and 
flattened fibers. Corning Glass Works, 
Telecommunications Products Div., MP- 
BH-5-1, Corning, NY 14831; (607) 
974-8705; free. 
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CALL FOR PAPERS 


Call for papers for Computer 


Computer magazine seeks articles that 
cover the state of the art and important 
new developments in computer science, 
technology, and applications. Aimed at a 
broad audience with diverse interests and 
experience. Computer usually publishes 
surveys or tutorials that facilitate the 
transfer of technology from university to 
industry, from research to applications, 
and across specialized fields. Submit six 
copies of the manuscript, including illus¬ 
trations, references, and authors’ biogra¬ 
phies, to the editor-in-chief: 


Michael C. Mulder 
Applied Research 
University of Portland 
Portland, OR 97203 
Phone (503) 283-7433 


Computer also seeks special issue pro¬ 
posals and articles in the following topic 

• microprocessor architectures 

• networking and distributed 
computing 

• computer-integrated manufacturing 

• software quality assurance 

• programming environments 

Prospective guest editors and authors 
should submit proposals and articles 
directly to Michael Mulder. 

An authors’ information sheet can be ob¬ 
tained from Michael Mulder at the above 
address or from the IEEE Computer So¬ 
ciety West Coast Office, 10662 Los Va- 
queros Circle, Los Alamitos, CA 90720; 
(714) 821-8380. 


NCC-87, National Computer Conference 
(AFIPS): June 15-18, 1987, Chicago, Illinois. 
Submit abstracts (approximately 500 words) 
by August 31, 1986, to Margaret Butler, PO 
Box 129, Lemont, IL 60439. 


Second International Conference on 
Computers and Applications: June 
24-26, 1987, Beijing, China. Long and short 
papers are sought. Long papers will be judged 
on the basis of a 100-word abstract and the 
full text (20 double-spaced pages maximum); 
short papers will be judged on the basis of a 
100-word abstract and a summary of 1000 
words. To submit a paper in either category, 
send four copies of both required items by 
August 31, 1986, to Oscar N. Garcia, Dept, 
of Electrical Engineering and Computer 
Science, Room T-637, George Washington 
University, 801 —22nd St., NW, Washington 
DC 20052; (202) 676-7175 or Zhang Xiao- 
xiang, Institute of Computing Technology, 
Academia Sinica, PO Box 2704; 100080 Bei¬ 
jing, China; phone 283131, ext. 556. 


Ninth IEEE International Conference 
on Software Engineering (ACM): 
March 30-April 2, 1987, Monterey, Califor¬ 
nia. Submit four copies of a paper (6000 
words maximum; a full-page figure equals 
300 words) that includes a short abstract and 
a list of keywords by September 1, 1986, to 
Robert Balzer, Information Sciences Institute, 
4676 Admiralty Way, Marina del Rey, CA 
90291 or Kouichi Kishida, Software Research 
Associates, 1-1-1, Hirakawa-cho, Chiyoda- 
ku, Tokyo 102, Japan. 


IEEE Transactions on Computers: 

Papers are sought for three special 
issues planned for 1987. The first will be 
devoted to parallel and distributed processing 
systems. Submit six copies of a paper by Sep¬ 
tember 1,1986, to John A. Stankovic, Dept, 
of Computer and Information Science, Uni¬ 
versity of Massachusetts, Amherst, MA 
01003; (413) 545-0720. The second will cover 
real-time systems. Submit six copies of a 
paper by December 1, 1986, to Kang G. Shin, 


Calls are listed according to submittal deadlines. Conferences that the Computer Socie¬ 
ty participates in or sponsors are indicated by the IEEE Computer Society logo; others 
of interest to our readers are also included. For inclusion in “Call for Papers,” submit 
information six weeks before the month of publication (e.g., for the November 1986 
issue, send information for receipt by September 15, 1986) to COMPUTER, 10662 Los 
Vaqueros Circle, Los Alamitos, CA 90720. 


Dept, of Electrical Engineering and Computer 
Science, University of Michigan, Ann Arbor, 
MI 48109; (313) 763-0391. The third will be 
devoted to supercomputing. Submit six copies 
of a paper by February 1, 1987, to H. C. 
Torng, School of Electrical Engineering, Cor¬ 
nell University, Ithaca, NY 14853-5401; (607) 
255-5191. (Guidelines for submitting manu¬ 
scripts appear on the back cover of every 
issue of IEEE Transactions on Computers.) 


Sixth Symposium on Reliability in 
Distributed Software and Database 
Systems (ACM, NASA): March 17-19, 1987, 
Williamsburg, Virginia. Submit four copies of 
complete paper by September 1, 1986, to 
Larry Wittie, Computer Science Dept., State 
University of New York —Stony Brook, Stony 
Brook, NY 11794-4400. 


CSC-87,1987 ACM Computer Science Con¬ 
ference: February 17-19, 1987, St. Louis, 
Missouri. Papers (10 pages maximum) that 
are suitable for 20-minute presentations are 
sought. Submit five copies by September 5, 
1986, to Arlan R. DeKock or George Zobrist, 
MCS 325, University of Missouri-Rolla, 

Rolla, MO 65401; (314) 341-4492. Short 
reports on current research activities are also 
sought (these should be suitable for 12-minute 
presentations and will be judged on the basis 
of a 500-word abstract), as are proposals for 
panel sessions and tutorials. Submit materials 
in these three categories by the same date to 
Arlan DeKock or George Zobrist. 

18th Annual Sigcse Technical Symposium 
(ACM): February 19-20, 1987, St. Louis, 
Missouri. Along with statement of intention 
to attend, submit five copies of the complete 
paper by September 5, 1986, to Dan C. 

St. Clair, Computer Science Dept. UMR, 
UMSL Campus, 8001 Natural Bridge Rd., St. 
Louis, MO 63121-4499. 


i IEEE Computer Society S; 


Office Automation: Integration, Inter¬ 
connection, and Use of Personal Computers 
(NBS): April 27-29, 1987, Gaithersburg, 
Maryland. Submit four copies of complete 
paper (5000 words maximum) and abstract by 
September 15, 1986, to David M. Choy, IBM 
Almaden Research Center, 650 Harry Rd., 
Dept. K52/803, San Jose, CA 95120-6099. 

IEEE Transactions on Software Engi- 

neering: Articles are sought for a special 
issue on software for local area networks. 
Submit six copies of the article and an IEEE 
copyright release form by September 15, 

1986, to Sushil Jajodia, Naval Research 
Laboratory, Code 7594, Washington DC 
20375; (202) 767-3596 or Satish K. Tripathi, 
Dept, of Computer Science, University of 
Maryland, College Park, MD 20742; (301) 
454-5165. 
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CAREER OPPORTUNITIES 


RATES: $9.50 per line, $95 minimum charge (up 
to ten lines). Average six typeset words per line, 
nine lines per column inch. Add $8 for box num¬ 
ber. Send copy at least six weeks prior to month 
of publication to: Sandra J. Arteaga, Classified 
Advertising, COMPUTER Magazine, 10662 Los 
Vaqueros Circle, Los Alamitos, CA 90720. 


In order to conform to the Age Discrimination in 
Employment Act and to discourage age discrim¬ 
ination, COMPUTER may reject any advertise¬ 
ment containing any of these phrases orsimilar 
ones: “...recent college grads...,” “...1-4 
years maximum experience..“.. .up to 5 
years experience...,” or "...10 years max¬ 
imum experience.” COMPUTER reserves the 
right to append to any advertisement, without 
specific notice to the advertiser, “Experience 
ranges are suggested minimum requirements, 
not maximums.” COMPUTER assumes that, 
since advertisers have been notified of this 
policy In advance, they agree that any ex¬ 
perience requirements, whether stated as 
ranges or otherwise, will be construed by the 
reader as minimum requirements only. 


MICHIGAN STATE UNIVERSITY 

The Department of Computer Science invites ap¬ 
plications for tenure-track positions at all levels. 
Candidates from all areas of specialization in 
computer science or computer engineering will 
be considered. The department has a special in¬ 
terest in candidates in the areas of programming 
languages, database systems, artificial intelli¬ 
gence and expert systems, robotics, design of 
computer systems and networks, parallel com¬ 
putation, dataflow machines, operating systems 
and computational complexity. Candidates 
should have a Ph.D. in computer science or com¬ 
puter engineering and have a strong interest in 
both research and teaching. Applications will be 
accepted until the positions are filled. 

As a unit within the College of Engineering at 
Michigan State University, Computer Science of¬ 
fers the Bachelor of Science, Master of Science 
and Doctor of Philosophy degrees. Special sup¬ 
port is available from within the college and 
university to initiate research by new faculty 
members. Michigan State University enjoys a 
park-like campus of 2,100 developed acres and 
3,100 acres of experimental farms, outlying 
research facilities and natural areas. The cam¬ 
pus is adjacent to the cities of East Lansing and 
the capital city, Lansing. The Greater Lansing 
area has approximately 250,000 residents. The 
communities have fine school systems and 
place a high value on education. 

Applicants should send a resume and a state¬ 
ment of research and teaching interests to: 

Dr. Anthony S. Wojcik, Chairperson 
Department of Computer Science 
A714 Wells Hall 
Michigan State University 
East Lansing, Michigan 48824-1027 
CSNET: wojcik@mich-state 
Michigan State University is an Equal Oppor¬ 
tunity/Affirmative Action Institution and en¬ 
courages applications from members of ethnic 
minority groups. 


THE UNIVERSITY OF ALBERTA 
Department of Computing Science 

The Department of Computing Science is 
undergoing an extensive expansion in research 
initiatives. Applications are invited for three 
tenure-track positions at the Assistant/Asso¬ 
ciate Professor level. Responsibilities include 
research as well as teaching at the graduate and 
undergraduate levels. Candidates from all areas 
will be considered. Current hardware support in¬ 
cludes an Amdahl 5870, a network of VAX 
11/780's, and well equipped mini and micro com¬ 
puter laboratories for graphics, VLSI, and Al 
research. Access to a Cyber 205 is available. 
Salary range is $30,316 to $48,970 and is com¬ 
mensurate with qualifications and experience. 
Send curriculum vitae, names of three refer¬ 
ences, and up to three reprints or papers. New 
Ph.D.'s should also include a copy of their 
transcript. Apply to: Dr. Lee White, Chairman, 
Department of Computing Science, University of 
Alberta, Edmonton, Alberta, T6G 2H1. Applica¬ 
tions will be accepted until August 31,1986. The 
University of Alberta is an equal opportunity 
employer. 


RESEARCH SCIENTIST 

OCLC, Online Computer Library Center, Inc., has 
an opening at its corporate headquarters 
(Columbus, OH metro area) fora Research Scien¬ 
tist. Appropriate candidates should have an ex¬ 
pertise and intellectual interest in and the ability 
to carry out independent research projects in 
one of the following areas: library needs and ser¬ 
vices, library automation, information science, 
artificial intelligence, database systems, user in¬ 
terface, telecommunications. Candidates 
should also have a Ph.D. or equivalent ex¬ 
perience in a discipline relative to the above 
products (such as Computer Science, Informa¬ 
tion Science or Library Science), experience in a 
research environment, and excellent verbal and 
written communication skills. Competitive 
salary and a comprehensive benefits package 
will be offered. For position description and 
company information, contact Bill Wolfe, Tech¬ 
nical Employment Representative, 6565 Frantz 
Rd„ Dublin, OH 43017,1-614-764-6097. An Equal 
Opportunity Employer M/F. 


SYRACUSE UNIVERSITY 

The School of Computer and Information 
Science invites applications for visiting posi¬ 
tions at all ranks for one or two terms during 
1986-87. The post will entail graduate and under¬ 
graduate teaching, and conduct of research. A 
Ph.D. or equivalent published research is re¬ 
quired. Research specializations in logic pro¬ 
gramming, artificial intelligence, programming 
methodology, analysis of algorithms, operating 
systems, or language design will be of particular 
interest. 

Resumes, names of referees, and selected 
papers or theses should be sent as soon as 
possible to Professor Kenneth A. Bowen, 313 
Link Hall, Syracuse University, Syracuse, New 
York 13244-1240. 

S.U. is an EO/AA employer. 


ROCHESTER INSTITUTE OF TECHNOLOGY 
Faculty Position in Computer Engineering 

The Department of Computer Engineering an¬ 
ticipates having an excellent tenure track faculty 
position available at the assistant/associate pro¬ 
fessor level beginning in December, 1986. Ap¬ 
plicants will be expected to have a strong in¬ 
terest and demonstrated ability to teach VLSI 
design and the ability to contribute to curricular 
evolution. Research in the areas of VLSI design, 
design automation, computer architecture, 
operating systems, and computer communica¬ 
tions will be encouraged. Computer Engineering 
faculty and students have regular use of at least 
9 large VAX systems, a CALMA VLSI design 
system and numerous workstations. The Depart¬ 
ment of Computer Engineering has recently 
completed a move to brand new facilities. This 
position affords unique opportunities to work in 
the hardware/software interface and to exercise 
significant influence on research, laboratories, 
and curricula. Applicants should have com¬ 
pleted a Ph.D. degree in computer engineering or 
a related area by the time of appointment. Ap¬ 
plicants should submit a resume and arrange for 
three letters of recommendation to be sent to Dr. 
Roy S. Czernikowski, Dept, of Computer Engr., 
Rochester Institute of Technology, Rochester, 
NY 14623. RIT is an equal opportunity affirmative 
action employer. 


COMPUTER SOFTWARE ENGINEER 

Formulates mathematical models of systems 
and sets up and corrects computer systems to 
solve business and management problems; 
maintains larger on-line software systems using 
IBM/370 hardware or similar; designs & develops 
on-line systems for data center management; 
consults with originator of problems to deter¬ 
mine sources & methods of data collection & 
methods of determining values of variables; ex¬ 
amines and studies verbal descriptions of prob¬ 
lems to apply knowledge of scientific discipline 
and define problems; modifies and implements 
existing software for other hardware, including 
Tandem machines; writes, tests, debugs and 
modifies software using IBM/370 assembler & 
TSO TEST command; writes program documen¬ 
tation manuals and user guides; assists sales 
and marketing personnel in demonstrations of 
these software packages; requires minimum 2 
years experience + Bachelor of Science degree 
in Computer Science or Mathematics, on-line 
systems design experience, knowledge, and use 
of TAL, PL/I, and IBM/370 assembler. Job site 
Los Angeles. Salary $3,000/month. Send this ad 
and your resume to job #NOF5755, PO Box 9560, 
Sacramento, CA 95823-0560 not later than 
August 30, 1986. 


PROGRAMMER 

Programmer, Unix systems, wanted to develop 
special user interface programs and graphic 
libraries using LISP language and Unix program¬ 
ming. Job site, Goleta, CA. Bachelor degree in 
Computer Sciences, at least 9 months experi¬ 
ence in Unix systems programming, salary 
$29,000 annually. Send this ad and a resume to 
job #NOF5782, PO Box 9560, Sacramento, CA 
95823-0560 not later than September 1, 1986. 
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THE BDM CORPORATION 
COMPUTER SCIENTISTS 

The BDM Corporation has openings for Master’s- 
and PhD-level computer scientists to contribute 
to research and development in advanced com¬ 
puting technologies. Current efforts focus on 
supporting all aspects of DARPA's Strategic 
Computing program. These positions will in¬ 
volve selected research tasks as well as analysis 
related to military applications which exploit 
emerging technologies. All candidates should 
have a strong background in theoretical com¬ 
puter science with emphasis on language 
design, compilers, operating systems, and ar¬ 
chitecture. In addition, candidates should have 
specialized knowledge of expert systems, other 
areas of artificial intelligence, parallel process¬ 
ing, optical computing, or microelectronics. 
BDM is a highly diversified professional services 
company with offices throughout the United 
States. These openings are located in Northern 
Virginia. Please send resumes to: P. Bradford 
Sterl, The BDM Corporation, Dept. SCC-886, 
7915 Jones Branch Dr., McLean, Virginia 22102. 
An equal opportunity employer. U.S. Citizenship 
is required. A subsidiary of BDM International, 


FLORIDA ATLANTIC UNIVERSITY 


position may be obtained from Dr. CD Marlin, 
Department of Computer Science, telephone (08) 
228 5586 or UUCP:...!munnari!uacomsci.ua.oz! 

It is University policy to encourage women to 
apply for consideration for appointment to tenur- 
able academic appointments. 

Holders of full-time tenured or tenurable aca¬ 
demic appointments have the opportunity to 
take leave without pay on a half-time basis for a 
specific period of up to ten years where this is 
necessary for the care of children. 

Information about the general conditions of ap¬ 
pointment may be obtained from the Senior 
Assistant Registrar (Personnel) at the University. 
Salary per annum: A$27,233 x 7-A$35,777 
Applications, in duplicate, quoting reference 
number 1508 and giving full personal particulars 
(including whether candidates hold Australian 
permanent residency status), details of academ¬ 
ic qualifications and names and addresses of 
three referees should reach the Senior Assistant 
Registrar (Personnel) at the University of 
Adelaide, GPO Box 498, Adelaide, South 
Australia, 5001, Telex UN IVAD AA 89141 not later 
than September 15, 1986. 

The University reserves the right to make en¬ 
quiries of any person regarding any candidate’s 
suitability for appointment, not to make an ap¬ 
pointment or to appoint by invitation. 

The University is an equal opportunity employer. 


CALENDAR 


August 1986 

gL Tutorial Week Los Angeles 86, August 
18-22, Marina del Rey, California. Con¬ 
tact Tutorial Week Los Angeles 86, IEEE 
Computer Society, 1730 Massachusetts Ave., 
NW, Washington DC 20036-1903; (202) 
371-0101; TWX 7108250437 1EEECOMPSO. 

15th Annual International Conference 
on Parallel Processing (ACM), August 
19-22, St. Charles, Illinois. Contact Tse-Yun 
Feng, Dept, of Electrical Engineering, EE 
East Bldg., Pennsylvania State University, 
University Park, PA 16802; (814) 863-1469. 

Simulation Conference 10, August 25-28, 

Washington DC. Contact Cathy Brown, 
CACI, 3344 N. Torrey Pines Court, La Jolla, 
CA 92037; (619)457-9681. 

IEEE Workshop on Languages for 
Automation, August 27-29, Kent Ridge, 
Singapore. Contact Shi-Kuo Chang, Dept, of 
Electrical Engineering, Illinois Institute of 
Technology, Chicago, IL 60616; (312) 
567-3401. 


Tenure track faculty positions are available in 
August or January in our expanding Computer 
Engineering programs. Applicants should have 
an earned doctorate in electrical or computer 
engineering or in computer science. Research 
interests in artificial intelligence, operating 
systems, computer architecture, microproces¬ 
sor systems and computer networks are par¬ 
ticularly appropriate, although other specialties 
will be considered. Florida Atlantic University, 
part of the State University System, is located in 
the middle of a growing number of high-tech 
industries and has close educational and re¬ 
search ties with many of them. Facilities include 
HP64000 and INTEL Development systems, 
GOULD, RIDGE and DEC minicomputers, a wide 
variety of microcomputers and access to a num¬ 
ber of large machines. Send a resume with the 
names of at least three references to Dr. Roger A. 
Messenger, Chairman, Department of Electrical 
and Computer Engineering, Florida Atlantic 
University, Boca Raton, Florida 33431. An equal 
opportunity/affirmative action employer. 


THE UNIVERSITY OF ADELAIDE 

invites applications from both women and men 
for the following position of: 

Lecturer in Computer Science 
Tenurable 

(Ref: 1508) in the Department of Computer 
Science, which runs undergraduate teaching 
and postgraduate research programs. Computer 
facilities include a number of VAX/VMS and Unix 
systems, and a number of Sun workstations, and 
state-of-the-art CAD facilities, all connected by 
Ethernet and managed locally. 

Applicants should hold a higher degree in Com¬ 
puter Science, be able to demonstrate research 
capability, teaching interest, and be prepared to 
supervise research students. 

The position is available from January 1, 1987. 
Further information concerning the duties of the 


Faculty Position in 
Computer Science 

The University of Calgary invites applications for 
a faculty position in the Department of Com¬ 
puter Science. This department has a strong 
research program as well as a commitment to 
teaching excellence at B.Sc., M.Sc. and Ph.D. 
levels. Research is well supported by external 
grants, funded chairs, and excellent physical 
facilities. Active research areas include: 
distributed programming environments; data¬ 
bases; software prototyping; artificial intelli¬ 
gence; human computer interaction and 
systems design; networks; graphics and anima¬ 
tion; simulation; numerical analysis; and VLSI 
design tools, specification and verification. 
Plans are underway for major expansions of both 
the undergraduate and graduate teaching pro¬ 
grams. The department has at its disposal two 
dual and two single processor VAX 11-780’s, a 
Xerox Dolphin, 24 Corvus Concepts, extensive 
graphics equipment, etc., all connected to 10 
Mbit LAN’s. Faculty also have access to the 
University Multics (the largest academic Multics 
installation in the world) and Cyber systems, in¬ 
cluding a Cyber 205 Supercomputer. 

Applicants are expected to have a Ph.D. in Com¬ 
puter Science or equivalent research experience, 
as well as some teaching qualifications. Duties 
will include both research and teaching at the 
undergraduate and graduate level. Salary and 
rank are negotiable. In accordance with Cana¬ 
dian Immigration Requirements, priority is 
directed to Canadian citizens and permanent 
residents. An application should contain a cur¬ 
riculum vitae and the names of three referees 
and should be sent prior to October 1,1986 to: 

Dr. John Kendall, Head 
Department of Computer Science 
The University of Calgary 
2500 University Drive N.W. 

Calgary, Alberta, Canada T2N 1N4 
Telephone: (403) 220-5454 


September 1986 


gL 25th Annual Lake Arrowhead Work- 
shop: Frontiers and Limitations of 
High-Performance Computing, September 

3-5, Lake Arrowhead, California. Contact D. 
A. Giese, AP Labs, Inc., 4411 Morena Blvd., 
Suite 150, San Diego, CA 92117; (619) 
272-8890. 


gi International Test Conference, Septem- 
ber 7-11, Washington DC. Contact 
Doris Thomas, PO Box 264, Mount 
Freedom, NJ 07970; (201) 895-5260. 


gi Tutorial Week Boston 86, September 
8-12, Cambridge, Massachusetts. Con¬ 
tact Tutorial Week Boston 86, IEEE Com¬ 
puter Society, 1730 Massachusetts Ave., NW, 
Washington DC 20036-1903; (202) 371-0101; 
TWX 7108250437 IEEECOMPSO. 


gc, Euromicro 86, September 14-18, 

Venice, Italy. Contact Jan Wilmink, 
p/a TH Twente, Dept. INF, Room A306, PO 
Box 217, 7500 AE Enschede, The Nether¬ 
lands; phone +(31) (53) 338799; telex 44200 
thes. 


Conferences that the Computer Society 
participates in or sponsors are indicated 
by the IEEE Computer Society logo; other 
conferences of interest to our readers are 
also included. For inclusion in Calendar, 
submit information six weeks before the 
month of publication (e.g., for the November 
1986 issue, send information for receipt by 
September 15,1986) to COMPUTER, 10662 
Los Vaqueros Circle, Los Alamitos, CA 
90720. 
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NEXT IN COMPUTER 


September How Super Are Supercomputers? 

Supercomputer manufacturers like to tout peak- 
performance computation rates as though the 
machines regularly operate at such speeds. But 
as our special report from IBM shows, perfor¬ 
mance really depends on the degree to which 
computational problems can be broken into con¬ 
current tasks. Also in this issue: the ELXSI 6400 
multiprocessor database machine. Coming in 
October: Gallium arsenide microprocessor 
technology. 
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Computer Scientists 

Few Companies In The World Will 
Challenge You As Much As Lawrence 
Livermore National Laboratory! 

When the topic is research and development, Lawrence Livermore National Laboratory has the unexcelled 
ability to study physical phenomena through the utilization of advanced computer technologies. LLNL drives 
the state-of-the-art in many areas of science. We are able to achieve more because we have some of the best 
people in the field and more resources at our fingertips. In addition to a host of other computers distributed 
throughout the Laboratory, we utilize two large scientific computer centers where some of the most powerful 
computers in the world, like the Cray-XMP, are used to solve scientific problems. 

You can’t be a world leader in Computer Science unless your team is made up of leading Computer Scien¬ 
tists. Using all conceivable computer systems to solve problems in Nuclear Weapons, Magnetic Fusion 
Energy, Laser Fusion and Laser Isotope Separation, we’ll challenge you more than you’ve ever been. Con¬ 
sider the following LLNL opportunities for qualified professionals who hold a BS, MS, or PhD in Computer 
Science or a related discipline: 

Graphics 

As part of our team, you will work with state-of-the-art hardware and software graphics systems to solve 
system and application graphics problems. 

Operating Systems 

Become a member of the team designing and developing a distributed operating system for some of the 
world’s fastest computers integrated with a wide range of other computers. 

Networking 

Be part of the effort to develop a network connecting distributed resources throughout the Laboratory to 
achieve high performance, efficient computer-to-computer communications, resource sharing, and 
distributed computing. 

Real-Time Data Acquisition Systems 

You will work with mini and micro computers which control and acquire data from small and large ex¬ 
periments taking place both on-site and in the field. 

Modeling & Simulation Codes 

Working with scientists you will develop complex modeling and simulation code systems to characterize 
physical phenomena for both military and non-military applications. 


Language Development 


Work with other Computer Scientists on the implementation and support of language translators and 
associated language environment software. 

Scientific Workstation Development 

In this effort, you will be part of a team to develop an intelligent terminal to be used for interactive work and 
communication with other workstations, computer centers and networks. 

We have the tools, experience and resources to offer a rewarding and meaningful career in Computer 
Science. If you have the qualifications and the desire to grow, we have the opportunities and the future. Send 
your resume to: Sue Porter, Professional Employment Division, Lawrence Livermore National 
Laboratory, Dept. KCF81603B, P.O. Box 5510, Livermore, California 94550. 











The IEEE Computer Society’s Tenth International 
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Computer Software & Applications Conference 



Monday, October 6,1986 

1 PROLOG AND KNOWLEDGE INFORMATION PROCESSING 
■ Doug DeGroot, Quintus Computer Systems 

PROLOG will be introduced and evaluated. Starting with a quick introduction to logic, the seminar continues with a series of 
complex PROLOG examples; Unification and resolution are explained; A simple PROLOG interpreter is explored to show how 
the control component of PROLOG works; A series of typical application domains of PROLOG are investigated. Upon comple¬ 
tion the student should be able to begin writing PROLOG programs and develop an appreciation of the language's power, ease 
of expression and simplicity of control. 

2 TECHNIQUES AND STRATEGIES FOR RESTRUCTURING SOFTWARE 
■ Robert S. Arnold, MITRE Corporation 

Hard-to-change software is devalued software. This seminar is dedicated to preserving an organization's software investment. 
This seminar will: 

• Show how software restructuring can help achieve software maintenance goals. 

• Present over 30 techniques for restructuring software, and how to make a wise choice among them. 

• Show how to calculate when software restructuring will pay, and when it won't. 

• Show how to stop software structure from deteriorating in the first place. 

Tuesday, October 7,1986 

3 NEW PARADIGMS FOR SOFTWARE DEVELOPMENT 
■ William W. Agresti, Computer Science Corporation 

The software life-cycle "waterfall" model has been criticized recently as not always being a useful model of the development 
process. This tutorial exposes the limitations and assumptions of the life-cycle model. Prototyping, operational specification, and 
transformational implementation are presented as the bases of new paradigms of software development that respond to criti¬ 
cisms of the life-cycle model. Examples will illustrate how these three methodologies are being used today. The tutorial discusses 
an organization's transition from the life-cycle to newer paradigms. 

4 INTERACTIVE SOFTWARE DEVELOPMENT ENVIRONMENTS 
■ Anthony I. Wasserman, University of California 

The state of the art in the use of interactive systems for software development is presented. The seminar examines the general 
considerations of human factors, the history of interactive systems, specific tools and development environments, and the short 
and medium term future of interactive development environments. Particular attention is given to high performance work¬ 
stations based on bit-mapped graphical displays, and to the necessary software development tools for such an environment. A 
distinction is drawn between programming and more general software development environments. 


KEYNOTE & PLENARY 
ADDRESSES 

SIX TRACKS 

Prognostications on Software Productivity 

Dr. Winston W. Royce, Director 

Lockheed Software Technology Center 

Software Productivity 

Dr. Howard Yudkin, President 

Software Productivity Consortium, Inc. 

Distributed Software Factory and C&C Satellite Office 

Dr. Kiichi Fujino, Vice President 

NEC Corporation 

Software Quality 

1 minireview, 3 panels, 4 paper sessions 

Software Engineering 

2 minireviews, 2 panels, 2 paper sessions 

Software Requirements 

1 minireview, 2 panels, 1 paper session 
Development Environments 

1 minireview, 2 panels, 3 paper sessions 

Software Techniques 

1 panel, 6 paper sessions 

Knowledge Based Systems 

1 minireview, 1 panel, 2 paper sessions 

— 


COMPSAC only 
COMPSAC+1 Tutorial 
COMPSAC+2 Tutorial 

1 Tutorial only 

2 Tutorial only 


Member 

$130 

$260 

$360 

$165 

$285 


den: (312) 972-5585. A 
deadline: September 14,1986. 

HOTEL REGISTRATION—Ca 
The Americana Congress Hoi 
520 South Michigan Avenue, Chicago, IL 60605, U.S.A. 
(312) 427-3800. 


$390 

$195 

$315 


$370 

$490 

$250 

$400 


$125 
$180 
$ 90 


te by Sept. 14, 1986: 


Room prices—Mention reservation for COMPSAC 86 

□ Single @ $48 □ Single @ $52 

□ Doubles/Twin @ $58 □ Double @ $62 
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